25 Commits

Author SHA1 Message Date
yupeng
29f195a91c [Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156)
### What this PR does / why we need it?
Fix the error that reports while initializing qwen3-reranker-0.6b model
with `--enable-lora`.
And add a testcase to verify the fix.

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-15 17:55:42 +08:00
realliujiaxu
5e24b26a54 [Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883)
### What this PR does / why we need it?

PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/  \
    --port 8001 \
    --served-model-name auto \
    --max-model-len 1024 \
    --enforce-eager \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --additional-config '{"enable_shared_expert_dp": true}'
```

#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "auto",
  "messages": [
    {"role": "user", "content": "Hello. I have a question. Who are you?"}
  ],
  "max_tokens": 10,
  "temperature": 0.0,
  "ignore_eos_token": true
}'
```

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2026-03-01 20:22:50 +08:00
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
SILONG ZENG
19b5d44ea8 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #10) (#6173)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|`vllm_ascend/ops/layer_shard_linear.py`|
|`vllm_ascend/ops/linear.py`|
|`vllm_ascend/ops/linear_op.py`|
|`vllm_ascend/worker/worker.py`|
| ` vllm_ascend/patch/worker/patch_bert.py` |
| ` vllm_ascend/patch/worker/patch_deepseek.py` |
| ` vllm_ascend/patch/worker/patch_distributed.py` |
| ` vllm_ascend/patch/worker/patch_module.py` |
| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` |
| ` vllm_ascend/patch/worker/patch_qwen3_next.py` |
| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` |
| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` |
| ` vllm_ascend/patch/worker/patch_rope.py` |
| ` vllm_ascend/patch/worker/patch_triton.py` |
| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` |
| ` vllm_ascend/patch/worker/patch_v2_egale.py` |
|` vllm_ascend/worker/npu_input_batch.py`|
|` vllm_ascend/worker/v2/aclgraph_utils.py`|
|` vllm_ascend/worker/v2/attn_utils.py`|
|` vllm_ascend/worker/v2/model_runner.py`|
|` vllm_ascend/worker/v2/sample/gumbel.py`|
|` vllm_ascend/worker/v2/sample/penalties.py`|
|` vllm_ascend/worker/v2/sample/sampler.py`|
|` vllm_ascend/worker/v2/spec_decode/__init__.py`|
|` vllm_ascend/worker/v2/spec_decode/eagle.py`|
|` vllm_ascend/worker/v2/states.py`|
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-06 15:35:06 +08:00
Wang Kunpeng
13cd6362c6 [bugfix] fix Error 'ValueError: Duplicate layer name' (#5280)
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-25 10:43:24 +08:00
Nengjun Ma
42c989a437 Update vllm pin to 12.24 (#5307)
### What this PR does / why we need it?
Fix vllm break in the pr:
1. [Add MiMo-V2-Flash support]
(https://github.com/vllm-project/vllm/pull/30836)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com)

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zxwang <1476209578@qq.com>
Co-authored-by: zxwang <1476209578@qq.com>
2025-12-24 17:24:31 +08:00
zzzzwwjj
cc23067f1e [refactor] refactor weight trans nz and transpose (#4878)
### What this PR does / why we need it?

Now `VLLM_ASCEND_ENABLE_NZ` will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;

And `VLLM_ASCEND_ENABLE_NZ`=1 by default.

All cases are shown in the table below:

|  | W4A4 | W4A8 | W8A8 | fp16/bf16 | fp32 |
|---|---|---|---|---|---|
| trans nz | can't support nz | trans nz by default | trans nz by
default | trans nz when VLLM_ASCEND_ENABLE_NZ is 2 | can't support nz |
| transpose | only support not transpose case | only support transpose
case | only support transpose case | linear: only support not transpose
case<br>gmm: only support transpose case | same to fp16/bf16 |

Some exceptional cases:
1. MLAPO op need to do some additional processing on the weights,
including trans nz. If use MLAPO op, some weight will be transformed to
nz forcely;
2. MLA/SFA's weight `W_UV` will be used by op
`torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support
nz currently;

### Does this PR introduce _any_ user-facing change?
Now fp16/bf16 weight will not trans nz by default.

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-19 14:27:24 +08:00
Slightwind
12ca99c94e [Bugfix] Remove ModelSlim-"M4 Quantization". (#4589)
The M4 quantization method in ModelSlim adds bias to model weights that
originally do not have a linear bias. PR #4235 supported PD-MIX
quantization and M4 quantization, adding bias to `w8a8.py` and
`w8a8_dynamic.py`, and implementing adaptations in `ops/linear.py` to
prevent it from being reset to `None` by
`self.register_parameter("bias", None)`. However, this modification
introduced an issue where the bias was still being reset to `None` in
certain scenarios, causing errors during service startup. Therefore,
support for M4 quantization is temporarily being reverted in this PR.
___
- vLLM version: v0.11.2

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-01 23:45:02 +08:00
zzhxxx
203b4e6777 [Bug_fix] fix torchair o_proj forward parameter (#4166)
### What this PR does / why we need it?
In `torchair_mla.py`, the `self.oproj` function includes an additional
parameter `is_force_scatter`, while the `AscendRowParallelLinear`
function in `linear.py` does not add this parameter.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2025-12-01 19:57:01 +08:00
Slightwind
18eefc23c3 [feature] Support W8A8 PD-Mix Quantization (#4235)
In PD-separated deployment scenarios:

* MoE layers use dynamic quantization exclusively.
* For the Attention module, Prefill (P) nodes use **dynamic**
quantization, while Decode (D) nodes use **static** quantization.

In PD-mixed deployment scenarios:
* **All components fall back to dynamic quantization**, as it is
difficult to distinguish between Prefill and Decode tokens.
___

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: Slightwind <slightwindsec@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-30 11:57:26 +08:00
InSec
5a4e8cdeba [Feat][BugFix]Support the Qwen3-Next-80B-A3B-Instruct quantization model&Fix the NZ issue (#4245)
### What this PR does / why we need it?
Support the Qwen3-Next-80B-A3B-Instruct quantization model and Fix the
NZ issue. Triton kernel doesn't support data format nz, thus we skip
converting weight to nz on layer `conv1d`

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: IncSec <1790766300@qq.com>
2025-11-21 10:42:56 +08:00
wangxiyuan
cc2cd42ad3 Upgrade CANN to 8.3.rc1 (#3945)
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.

TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-03 20:21:07 +08:00
rjg-lyh
c1a6aeab46 [main][bugfix] fix valueError in static_forward_context when prefix is empty (#3924)
### What this PR does / why we need it?
This PR temporarily bypasses the scenario where some models in vLLM
trigger a `ValueError` during the process of storing values in
`static_forward_context` when no `prefix` is specified for the linear
layers, which is a bug in some models in vLLM. The official fix will be
addressed by submitting a PR to the vLLM community that specifies a
prefix for the linear layers in each model.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-31 14:55:58 +08:00
rjg-lyh
292e213dd2 [main][refactor] refactor SequenceRowParallelOp forward (#3616)
### What this PR does / why we need it?
This PR refactors SequenceRowParallelOp forward. In order to further
expand the operator inclusion scope in dynamic judgment scenarios, this
PR customizes the entire matmul computation and communication as a
custom operator masking. With this refactor, it will support directly
writing code such as common operation fusion into the
`SequenceRowParallelOp` class's member function `matmul_and_reduce`,
without the need to register more redundant custom masking operators.

### How was this patch tested?
CI passed with existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-23 14:41:15 +08:00
linfeng-yuan
068ed706c8 [feat][torchair] support super kernel feat for quantized dsr1 (#3485)
### What this PR does / why we need it?
Port #1916 and #2157 to master branch to fuse operators in deepseek moe
layers, which can reduce scheduling overhead on devices. Note that this
feature is valid only when `tp_size = 1` and
`multistream_overlap_shared_expert` is enabled with torchair graph mode.

### Does this PR introduce _any_ user-facing change?
Users can enable this feature with `--additional-config
'{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true},
"multistream_overlap_shared_expert":true}'`.

### How was this patch tested?
E2E deepseek serving with 2P1D disaggregated prefill scenarios.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-20 20:04:37 +08:00
anon189Ty
07e39620ea [Feat] Unquantized Linear to nz and control all nz-cast (#3356)
### What this PR does / why we need it?
Currently, when executing to the Linear layer of models in vLLM-Ascend,
the weights format is ND in unquantized case and skipped ascend case.
This PR supplements the execution logic for Linear layer. We use a new
global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and
CANN version is 8.3, the weights of the Linear layer will be converted
to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also
use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as
w8a8-quantized case.

### Does this PR introduce _any_ user-facing change?
Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ
format, you should set VLLM_ASCEND_ENABLE_NZ=1.

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-10-14 17:39:26 +08:00
weijinqian0
6972df5951 [Feature] optimize sp & qwen3 next support sp. (#3225)
This PR will accomplish the following tasks: 
**optimize SP**
In the old version implementation, the first layer was all_reduce, which
used rms to split chunks. We changed it to perform reduce_scatter on the
embedding side, replace one all_reduce operation and one chunk with one
reduce_scatter operation.
**Support qwen3 next**
Since Qwen3 Next includes a linear attention module, the prefix name of
this module cannot take effect directly.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-10-13 23:02:12 +08:00
realliujiaxu
af2a886814 refactor linear (#2867)
### What this PR does / why we need it?
The current linear.py has the following issues:

- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.

- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.

To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
dd39baf717

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Co-authored-by: weijinqian0 <weijinqian@huawei.com>
2025-09-18 14:09:19 +08:00
Yikun Jiang
756b8a1946 Revert "[Feat] Unquantized linear nz support (#2619)" (#2896)
### What this PR does / why we need it?
This reverts commit 7b2ecc1e9a.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

- vLLM version: main
- vLLM main:
64d90c3e4f

Closes: https://github.com/vllm-project/vllm-ascend/issues/2890
Closes: https://github.com/vllm-project/vllm-ascend/issues/2887
Closes: https://github.com/vllm-project/vllm-ascend/issues/2885

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-12 20:51:12 +08:00
rjg-lyh
0005479b9c [main] mlp weight prefetch in Qwen Dense Models (#2816)
### What this PR does / why we need it?
This PR prefetchs the weight of mlp layers in Qwen Dense Models to
optimize the performance in Decode phase mainly.

### Does this PR introduce _any_ user-facing change?
 No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: main
- vLLM main:
a1213fae5f

Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Shuming19 <313093131@qq.com>
2025-09-11 21:20:09 +08:00
anon189Ty
7b2ecc1e9a [Feat] Unquantized linear nz support (#2619)
### What this PR does / why we need it?
Currently, when executing to the Linear layer of the model in
vLLM-Ascend, the weights input format is ND in unquantized case and
skipped ascend case, which is slower than FRACTAL_NZ.
This PR supplements the execution logic for Linear layer. When
VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights
of the Linear layer will be converted to FRACTAL_NZ, in both unquantized
case and skipped ascend case.

- vLLM version: main
- vLLM main:
267c80d31f

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-09-11 11:40:00 +08:00
Li Wang
22b425765a [Bugfix] Fix broken CI (#2825)
### What this PR does / why we need it?
1. Initial support disable tp for integrating with
[vllm-commit](https://github.com/vllm-project/vllm/pull/23024)
2. [vllm@commit](https://github.com/vllm-project/vllm/pull/23673) now
use `bytes` to save the `BlockHash` to reduce GC overhead, this pr add
the integration

- vLLM version: main
- vLLM main:
e40827280b

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-09-10 13:29:29 +08:00
rjg-lyh
1bbb20ea13 [main] flashcomm_v1 optim in Qwen Dense Models (#2802)
### What this PR does / why we need it?
Flashcomm_v1 optim in Qwen Dense Models.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.10.1.1
- vLLM main:
5e537f45b4

Co-authored-by: 1024daniel <xxltju324@gmail.com>
2025-09-08 22:52:24 +08:00
lidenghui1110
5a7181569c [feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167)
### What this PR does / why we need it?
This PR introduces Oproj matrix tensor model parallel to achieve
decreasing of memory consumption. It only support graph mode in pure DP
scenario.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8
GB NPU memory per RANK. We got best performance when
oproj_tensor_parallel_size=4 without TPOT increasing.

performance data:
<img width="1442" height="442" alt="image"
src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| oproj_tensor_parallel_size | Split the o_proj matrix along the row
dimension (head num * head dim) into oproj_tensor_parallel_size pieces.
| No | int | default value is None, once this value is set, the feature
will be enabled, head num * head dim must be divisible by this value. |

example

`--additional_config={"oproj_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
eddaafc1c7

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zzh <zzh_201018@outlook.com>
2025-09-07 10:31:32 +08:00
sherie
3fb80ee356 add mlp tp optimze (#2120)
### What this PR does / why we need it?
For dense models, by not applying tensor parallelism (TP) to the
attention module and applying TP to the MLP module, the allreduce
operations in the attention module can be eliminated, thereby reducing
computational overhead. However, this approach increases memory usage,
so the environment variable VLLM_ASCEND_ENABLE_MLP_OPTIMZE is used to
control this optimization.

- vLLM main:
b17109beea

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-08-21 09:22:07 +08:00