9 Commits

Author SHA1 Message Date
yupeng
830f39dd70 [Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650)
### What this PR does / why we need it?
Fix the issue #6143 .

### Does this PR introduce _any_ user-facing change?
Allow to start the server with "--enable-lora && --fully-sharded-loras
&& --tensor_parallel_size 2".

### How was this patch tested?
pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-11 15:43:15 +08:00
SILONG ZENG
6ccccad102 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #5) (#5996)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|
`.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py`
|
|
`vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py`
|
| `
.../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py`
|
| `
.../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py`
|
| `
.../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py`
|
| `
.../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py`
|
| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py`
|
| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` |
| `
vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` |
| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` |
| ` vllm_ascend/kv_offload/cpu_npu.py` |
| ` vllm_ascend/kv_offload/npu.py` |
| ` vllm_ascend/lora/lora_ops.py` |
| ` vllm_ascend/lora/punica_npu.py` |
| ` vllm_ascend/lora/utils.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
2026-01-24 22:45:38 +08:00
ZT-AIA
e11ff8e535 [BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394)
### What this PR does / why we need it?
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Depends on this commit https://github.com/vllm-project/vllm/pull/31408 
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
2026-01-09 15:57:43 +08:00
hukongyi
ea8f544ce7 [BugFix]Fix precision issue for LoRA feature (#4141)
vLLM version: v0.11.0
vLLM main: vllm-project/vllm

### What this PR does / why we need it?
   Fix the precision issue of the LoRA feature in vllm-ascend.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
pytest tests/lora/test_llama_tp.py::test_llama_lora -s
```
<img width="1319" height="879" alt="lora_test"
src="https://github.com/user-attachments/assets/2a0b2325-5b05-4bbc-ac03-a7c9f0ad9d4c"
/>


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: hukongyi <hukongyi@cmbchina.com>
2025-12-19 14:22:06 +08:00
zzzzwwjj
136ea9ff56 [refact] unified soc_version code (#4359)
### What this PR does / why we need it?

Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

### Does this PR introduce _any_ user-facing change?

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-26 14:28:55 +08:00
wangxiyuan
a1f142b7ad Drop 0.11.0 support (#4377)
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-24 17:08:20 +08:00
Mengqing Cao
cea0755b07 [1/N][Refactor] Refactor code to adapt with vllm main (#3612)
### What this PR does / why we need it?
This is the step 1 of refactoring code to adapt with vllm main, and this
pr aligned with
17c540a993

1. refactor deepseek to the latest code arch as of
17c540a993
 
2. bunches of fixes due to vllm changes
- Fix `AscendScheduler` `__post_init__`, caused by
https://github.com/vllm-project/vllm/pull/25075
- Fix `AscendScheduler` init got an unexpected arg `block_size`, caused
by https://github.com/vllm-project/vllm/pull/26296
- Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by
https://github.com/vllm-project/vllm/pull/23485
- Fix `MLAAttention` import,caused by
https://github.com/vllm-project/vllm/pull/25103
- Fix `SharedFusedMoE` import, caused by
https://github.com/vllm-project/vllm/pull/26145
- Fix `LazyLoader` improt, caused by
https://github.com/vllm-project/vllm/pull/27022
- Fix `vllm.utils.swap_dict_values` improt, caused by
https://github.com/vllm-project/vllm/pull/26990
- Fix `Backend` enum import, caused by
https://github.com/vllm-project/vllm/pull/25893
- Fix `CompilationLevel` renaming to `CompilationMode` issue introduced
by https://github.com/vllm-project/vllm/pull/26355
- Fix fused_moe ops, caused by
https://github.com/vllm-project/vllm/pull/24097
- Fix bert model because of `inputs_embeds`, caused by
https://github.com/vllm-project/vllm/pull/25922
- Fix MRope because of `get_input_positions_tensor` to
`get_mrope_input_positions`, caused by
https://github.com/vllm-project/vllm/pull/24172
- Fix `splitting_ops` changes introduced by
https://github.com/vllm-project/vllm/pull/25845
- Fix multi-modality changes introduced by
https://github.com/vllm-project/vllm/issues/16229
- Fix lora bias dropping issue introduced by
https://github.com/vllm-project/vllm/pull/25807
- Fix structured ouput break introduced by
https://github.com/vllm-project/vllm/issues/26737

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-10-24 16:55:08 +08:00
Zetong Li
a86ece5e39 [Bugfix][LoRA] Fix forward error and shape mismatch when using LoRA (#3153)
### What this PR does / why we need it?
Relying on #3044, this PR aims to further fix:
1. The forward error occured when `LogitsProcessorWithLoRA` calls
`AscendLogitsProcessor.forward`. Since `LogitsProcessorWithLoRA`
bypasses the MRO to call it, `super().forward(...)` in
`AscendLogitsProcessor.forward` will raise an error. This PR fixes it by
directly invoking `LogitsProcessor.forward(self, ...)`;
2. The shape mismatch in `add_lora_logits` in punica_npu.py. The
`lora_a_stacked` and `lora_b_stacked` are organized as [num_loras, 1,
lora_rank, hidden_size] and [num_loras, 1, vocab_size, lora_rank] shapes
respectively, but they are misunderstood in #1583---the last two
dimensions were assumed in reverse order, which causes errors in
`bgmv_shrink` and `bgmv_expand`. This PR fixes it by reverting it to the
previous version to align with the implementation in punica_cpu.py in
vllm.

### Dependencies
This PR depends on changes introduced by #3044 (LoRA support for
`AscendQKVParallelLinear` and `AscendMergedQKVParallelLinear` layers).

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
The LoRA-related tests, e.g., test_ilama_lora.py and
test_ilama_lora_tp2.py, use ilama-3.2-1B, and this model is regarded as
`TransformersForCausalLM`, where `embedding_modules` attribute lacks
`lm_head`. However, `LlamaForCausalLM` and most other models include
both `embed_tokens` and `lm_head` in `embedding_modules`. This attribute
contributes to `supported_lora_modules` when using LoRA in vllm.
Therefore, without `lm_head` in `embedding_modules`, current tests using
ilama-3.2-1B are unable to find the abve errors since
`LogitsProcessorWithLoRA` replacing `lm_head` is skipped. Simply using
Meta-Llama-3.1-8B-Instruct can reproduce the above errors and check
whether these fixes can work. What's more, it's necessary to add more
comprehensive tests for LoRA.

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

Signed-off-by: Zetong Li <slippersss@126.com>
2025-09-28 17:30:50 +08:00
wangxiyuan
7d6d9449a8 [Misc] Move lora patch file into lora module (#2797)
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:42:12 +08:00