### What this PR does / why we need it?
Initial version to support minimax-m2.5 on vllm-ascend.
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
### What this PR does / why we need it?
Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid
model, such as Qwen3Next and Qwen3.5.
Due to the restrictions of Ascend operators, all KV tensors, conv
tensors, and SSM tensors must be contiguous. Therefore, this PR uses the
following solution to generate the KV cache:
tensor1: [(kv_padding), conv , ...]
tensor2: [k , ssm , ...]
tensor3: [v , (mamba_padding), ...]
Under this scheme, although some waste may occur, the tensors of all
caches are guaranteed to be contiguous.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.
This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.
### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.
### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
co-authored-by: shen-shanshan <467638484@qq.com>
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Add basic 310p support. Only dense models work with eager mode now.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
Currently, the vllm pull request
(https://github.com/vllm-project/vllm/pull/24252) is causing operator
fusion to fail. This issue was previously fixed by patching the backend.
The root cause has been identified, and the problem can be resolved with
this pull request.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Revert PR 5253 to fix the smoking problem
### Does this PR introduce _any_ user-facing change?
Does not.
### How was this patch tested?
It was tested in the failure case.
Signed-off-by: Rifa <865071616@qq.com>
Currently, the vllm pull request
(https://github.com/vllm-project/vllm/pull/24252) is causing operator
fusion to fail. This issue was previously fixed by patching the backend.
The root cause has been identified, and the problem can be resolved with
this pull request.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
Fix the bug in the PCP overlay feature
1、Fix the bug related to PCP and EPLB overlap by including PCP size in
the word_size calculation.
2、In the PCP pooling scenario, a prompt has been added for setting the
cp_kv_cache_interleave_size.
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252
is causing operator fusion to fail, which can be mitigated by patching
the backend. Once the problem is completely resolved, I will submit a
new pull request to remove the patch.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### Motivation.
**Limitations of the current vLLM v1 scheduling strategy**
vLLM v1 scheduling currently enables chunkedprefill by default, which
processes prefill and decode requests simultaneously in a single
scheduling session. This can impact the overall system throughput and
performance in some scenarios.
Balance scheduling addresses this issue by synchronizing the number of
running queues across all schedulers to delay the scheduling of new
requests, thereby improving the overall system's steady-state decoding
time. This achieves:
✅Adding `balance_gather` to the scheduler synchronizes the number of
requests in the running queues between DPs.
✅Balance scheduling improves the decode steady-state time, thereby
increasing the overall output throughput of the inference system.
### Proposed Change.
**1.Feature Overview**
In the vLLM scheduler, running requests (i.e., requests that are already
undergoing pre-filled computation) have the highest priority, followed
by waiting requests (i.e., requests that have not yet been computed).
As shown in the diagram above, when the entire inference system exits
from a steady state, the scheduler will schedule a batch of new requests
for prefill operations and then synchronize them among the dynamic
programming (DP) models. This can cause some DP models that are entirely
decoded to synchronize with the number of prefilled tokens. Frequent
prefill scheduling by certain DP models can lead to a deterioration in
the overall system output throughput.
Balance scheduling synchronizes the number of running queue requests
across different DPs, and only schedules new requests for prefilling
when at least every scheduler has fewer than max_nun_requst.
**2.Implementation Design**
**3.Experiment Results**
- Fixed-length input scenario: In the performance test scenario with
3.5K fixed-length input and 1.5K fixed-length output, the throughput
performance was improved by approximately **18%** after adding balance
scheduling.
| Method | Model | Input Len | Request Count | Output Len | BatchSize |
Average TTFT | Average TPOT | e2e duration | Input Token Throughput |
Output Token Throughput | Request Throughput
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
---- | ---- |
| Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 |
591.9s | 3030.5 | 1297.3 | 0.86 |
| Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 |
70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 |
**4.Demo PR**
[#29721 ](https://github.com/vllm-project/vllm/pull/29721)
---------
Signed-off-by: GDzhu01 <809721801@qq.com>
We decided to release v0.13.0 soon. So no need to support 0.12.0 now.
Let's drop it.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975)
from vllm (https://github.com/vllm-project/vllm/pull/25784)
#
Suffix Decoding is a dynamic n-gram matching method that:
1. Uses suffix trees to generate speculative tokens quickly using branch
frequency counts.
2. Can keep a history of prior model responses, which tends to work very
well with repetitive agentic use cases.
3. Can be dynamically updated with newly generated tokens, and FIFO
eviction of older requests.
#
### Does this PR introduce _any_ user-facing change?
This feature should be implemented as opt-in and remain seamless for
users who do not require suffix speculative decoding.
For users who wish to enable it, they must first install
arctic-inference:
`pip install arctic-inference
`
After installation, the suffix speculative decoding feature can be
enabled using the following speculative config:
`--speculative_config '{"method": "suffix", "num_speculative_tokens":
5}'
`
### How was this patch tested?
This PR is currently being tested on vLLM
main:83f478bb19
with PR https://github.com/vllm-project/vllm/pull/25784
In our previous testing, suffix decoding achieved a 13%-30% throughput
improvement over n-gram on the sonnet dataset, tested on vllm-ascend
v0.9.1 with concurrency ranging from 2 to 40.
- vLLM version: v0.11.2
---------
Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
### What this PR does / why we need it?
Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.
We need to unify these codes based on the following points:
1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.
Based on the above points, we have made the following changes:
1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.
### Does this PR introduce _any_ user-facing change?
When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
Fix a bug caused by this pr:
https://github.com/vllm-project/vllm-ascend/pull/4223
The bug makes
vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch
in a wrong way
### How was this patch tested?
Tested in a single node. When the environment DYNAMIC_EPLB is set to
true, the patch works correctly. When it's set to false, the patch do
not patch
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Drop VLLM_USE_V1 usage. This env has been removed from vLLM already.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fixes a compatible bug with `torch_npu.npu_fused_infer_attention_score`
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/4020.
We modify the code according to
https://github.com/vllm-project/vllm-ascend/pull/3918.
And run below codes:
```python
# run with Qwen3-next-mtp
prompts = [
"Who are you?",
]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
distributed_executor_backend="mp",
gpu_memory_utilization=0.7,
speculative_config={
"method": "qwen3_next_mtp",
"num_speculative_tokens": 1,
},
max_model_len=4096)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
Outputs:
```text
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
```
Now, `torch_npu.npu_fused_infer_attention_score` is compatible with
Qwen3-Next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
On Arm systems, os.sched_yield() does not take effect, causing the GIL
(Global Interpreter Lock) to remain unrelinquished and resulting in CPU
bound issues. This PR applies a patch to sched_yield in vLLM, making the
process execute time.sleep(0) instead to release the GIL.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Qwen in A3.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
when using dynamic eplb, patch v1 executor to avoid create child process
failed.
### How was this patch tested?
deepseek in v3.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch.
The final goal is to remove all the patches and align the code arch to
vllm, thus we need to do the following work in next prs.
TODO:
- [x] remove patch on attention spec
- [ ] refactor the kvcache creation logic
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
1. CI passed with existing test.
2. Test pass with deepseek-v3.2-exp
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Refactor KVCache as page_size_bytes is ineffective.
1. Currently the `AttentionSpec` is patched, but the `page_size_bytes`
is still using that in vLLM in runtime, thus the patch is not working
actually. Thus this pr removes the patch on `AttentionSpec`, and will do
the final fix in vLLM.
2. Use `MLAAttentionSpec` instead of `FullAttentionSpec` to reduce
`page_size_bytes` of spec, so that num_blocks in spec could double
### How was this patch tested?
Test pass with Qwen3-Next and DeepSeek-V3.2-Exp
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Modify the enable range of _merge_multimodal_embeddings patch. The
current patch is only enabled for offline inference on the platform. For
online serviceization, due to the addition of the worker sub-process, it
is not enabled within the sub-process.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR aims to address the incompatibility of the `.masked_scatter_`
operation in the current `_merge_multimodal_embeddings` function on
Ascend. For now, it reverts to the previous version of the CPU
operation, which can be executed asynchronously on the device side to
enhance performance.
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with torchair graph mode and eager mode.
- vLLM version: v0.10.2
- vLLM main:
759ef49b15
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Really strange that `register_oot` doesn't work with `SharedFusedMoE`,
so we have to add this patch, for now.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
This PR won't have any effect in DeepSeek since we currently still stick
with the old `CustomDeepseekV2`.
- vLLM version: v0.10.1.1
- vLLM main:
0cdd213641
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. update `CachedRequestState` as `NewRequestData` changed in
https://github.com/vllm-project/vllm/pull/22570
2. drop maintenance of vllm v0.10.0 in the branch main
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
92ff41abea
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.
```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```
- vLLM version: v0.10.0
- vLLM main:
71683ca6f6
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed
- vLLM version: v0.9.2
- vLLM main:
7728dd77bb
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced
This is a part of #1422 backport.
Fixes https://github.com/vllm-project/vllm-ascend/issues/1396https://github.com/vllm-project/vllm-ascend/issues/1154
### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.
### How was this patch tested?
CI passed with new added and existing test.
- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a
Signed-off-by: MengqingCao <cmq0113@163.com>