xc-llm-ascend

Author	SHA1	Message	Date
yupeng	830f39dd70	[Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650 ) ### What this PR does / why we need it? Fix the issue #6143 . ### Does this PR introduce _any_ user-facing change? Allow to start the server with "--enable-lora && --fully-sharded-loras && --tensor_parallel_size 2". ### How was this patch tested? pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 15:43:15 +08:00
SILONG ZENG	6ccccad102	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #5 ) (#5996 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py` \| \| `vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py` \| \| ` .../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py` \| \| ` .../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py` \| \| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` \| \| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` \| \| ` vllm_ascend/kv_offload/cpu_npu.py` \| \| ` vllm_ascend/kv_offload/npu.py` \| \| ` vllm_ascend/lora/lora_ops.py` \| \| ` vllm_ascend/lora/punica_npu.py` \| \| ` vllm_ascend/lora/utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-01-24 22:45:38 +08:00
ZT-AIA	e11ff8e535	[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394 ) ### What this PR does / why we need it? The customized ascend operator sgmv_expand and sgmv_shrink applies only to the scenario where rank is 8,16,32,64. When rank >= 128, the operator is out of range, causing the model to report an error. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Depends on this commit https://github.com/vllm-project/vllm/pull/31408 - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>	2026-01-09 15:57:43 +08:00
hukongyi	ea8f544ce7	[BugFix]Fix precision issue for LoRA feature (#4141 ) vLLM version: v0.11.0 vLLM main: vllm-project/vllm ### What this PR does / why we need it? Fix the precision issue of the LoRA feature in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```bash pytest tests/lora/test_llama_tp.py::test_llama_lora -s ``` <img width="1319" height="879" alt="lora_test" src="https://github.com/user-attachments/assets/2a0b2325-5b05-4bbc-ac03-a7c9f0ad9d4c" /> - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: hukongyi <hukongyi@cmbchina.com>	2025-12-19 14:22:06 +08:00
zzzzwwjj	136ea9ff56	[refact] unified soc_version code (#4359 ) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-26 14:28:55 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
Zetong Li	a86ece5e39	[Bugfix][LoRA] Fix forward error and shape mismatch when using LoRA (#3153 ) ### What this PR does / why we need it? Relying on #3044, this PR aims to further fix: 1. The forward error occured when `LogitsProcessorWithLoRA` calls `AscendLogitsProcessor.forward`. Since `LogitsProcessorWithLoRA` bypasses the MRO to call it, `super().forward(...)` in `AscendLogitsProcessor.forward` will raise an error. This PR fixes it by directly invoking `LogitsProcessor.forward(self, ...)`; 2. The shape mismatch in `add_lora_logits` in punica_npu.py. The `lora_a_stacked` and `lora_b_stacked` are organized as [num_loras, 1, lora_rank, hidden_size] and [num_loras, 1, vocab_size, lora_rank] shapes respectively, but they are misunderstood in #1583---the last two dimensions were assumed in reverse order, which causes errors in `bgmv_shrink` and `bgmv_expand`. This PR fixes it by reverting it to the previous version to align with the implementation in punica_cpu.py in vllm. ### Dependencies This PR depends on changes introduced by #3044 (LoRA support for `AscendQKVParallelLinear` and `AscendMergedQKVParallelLinear` layers). ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? The LoRA-related tests, e.g., test_ilama_lora.py and test_ilama_lora_tp2.py, use ilama-3.2-1B, and this model is regarded as `TransformersForCausalLM`, where `embedding_modules` attribute lacks `lm_head`. However, `LlamaForCausalLM` and most other models include both `embed_tokens` and `lm_head` in `embedding_modules`. This attribute contributes to `supported_lora_modules` when using LoRA in vllm. Therefore, without `lm_head` in `embedding_modules`, current tests using ilama-3.2-1B are unable to find the abve errors since `LogitsProcessorWithLoRA` replacing `lm_head` is skipped. Simply using Meta-Llama-3.1-8B-Instruct can reproduce the above errors and check whether these fixes can work. What's more, it's necessary to add more comprehensive tests for LoRA. - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: Zetong Li <slippersss@126.com>	2025-09-28 17:30:50 +08:00
wangxiyuan	7d6d9449a8	[Misc] Move lora patch file into lora module (#2797 ) Cleanup useless file in patch module. Update the lora support list is OK in vLLM Ascend, no need to patch vLLM - vLLM version: v0.10.1.1 - vLLM main: `f4962a6d55` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-08 21:42:12 +08:00

9 Commits