xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	0d14f635b4	upgrade torch npu version (#4433 ) vLLM graph feature now rely on torch >=2.8. To make graph mode work, we need upgrade torch version as well. For long term support, upgrade torch to a newer one is good to go as well. Related vLLM change: https://github.com/vllm-project/vllm/pull/25110 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-12-01 19:01:55 +08:00
fluctlux	f1f6370ed9	[Feature] Integrate Suffix Spec Decoding (#4045 ) ### What this PR does / why we need it? This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975) from vllm (https://github.com/vllm-project/vllm/pull/25784) # Suffix Decoding is a dynamic n-gram matching method that: 1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts. 2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases. 3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests. # ### Does this PR introduce _any_ user-facing change? This feature should be implemented as opt-in and remain seamless for users who do not require suffix speculative decoding. For users who wish to enable it, they must first install arctic-inference: `pip install arctic-inference ` After installation, the suffix speculative decoding feature can be enabled using the following speculative config: `--speculative_config '{"method": "suffix", "num_speculative_tokens": 5}' ` ### How was this patch tested? This PR is currently being tested on vLLM main:`83f478bb19` with PR https://github.com/vllm-project/vllm/pull/25784 In our previous testing, suffix decoding achieved a 13%-30% throughput improvement over n-gram on the sonnet dataset, tested on vllm-ascend v0.9.1 with concurrency ranging from 2 to 40. - vLLM version: v0.11.2 --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>	2025-12-01 18:41:42 +08:00
zzzzwwjj	3023e15e23	add _cann_ops_custom gitignore (#4605 ) ### What this PR does / why we need it? add _cann_ops_custom dir to .gitignore - vLLM version: v0.11.2 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-01 18:37:32 +08:00
MidnightSun	f4871c6ab9	[Kernel] add triton kernels for sampling (#4550 ) ### What this PR does / why we need it? Replace pyorch implement of sampling with triton kernels ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.11.2 --------- Signed-off-by: Lord_of_Ironhill <suiweiyi@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Co-authored-by: Lord_of_Ironhill <suiweiyi@huawei.com> Co-authored-by: whx-sjtu <2952154980@qq.com>	2025-12-01 17:41:58 +08:00
zzhxxx	2b82320b66	[Bugfix] Fix bug with establishing the flashcomm2 and pp communication domains. (#4458 ) ### What this PR does / why we need it? The previous implementation of the flashcomm2 communication domain did not consider pp(pipeline parallel), which caused problems when enabling pp and flashcomm2. This PR fixes this issue. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-01 15:56:22 +08:00
dependabot[bot]	8c65009d62	Bump actions/setup-python from 6.0.0 to 6.1.0 (#4591 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.0.0 to 6.1.0. - vLLM version: v0.11.2 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-12-01 14:32:08 +08:00
Jade Zheng	51c8f60eb0	[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254 ) ### What this PR does / why we need it? Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens. I set the `non_blocking` argument to False when moving `exceeds_max_model_len` to the CPU. From what I understand, using `non_blocking=True` and immediately accessing the tensor on the CPU can cause accuracy problems. However, this issue doesn't happen when transferring data to a device. ref: https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/18 - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-12-01 10:22:36 +08:00
Ting FU	e8e20c0bbf	[BugFix] Fix Qwen2.5_Omni vision customized op attr err (#4568 ) Qwen2.5_Omni vision tower use AscendRMSNorm, which conatins a property function. It would be override by set_forward_context(), patch Qwen2_5OmniThinkerForConditionalGeneration func with customized _process_image_input() and _process_video_input() to fix it. ### What this PR does / why we need it? Fix Qwen2.5_Omni model infer image/video issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: Ting FU <futing10@huawei.com>	2025-12-01 09:18:55 +08:00
Wang Yixuan	c68ddc11ce	[OPS] add bmm_transpose ops (#3990 ) ### What this PR does / why we need it? Add a new fusion ops to custom_op, which can cobime the torch.bmm() and transpsose to achieve better peformance. This ops is used in mla_v1 to replace the bmm and transpose ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-01 09:09:51 +08:00
欧派果奶我还要	bc67696a02	[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216 ) ### What this PR does / why we need it? Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB to support list-type parameters This PR also modify the logic of loading model in dynamic-eplb scenario. The operator is based on this pr: https://github.com/vllm-project/vllm-ascend/pull/3804 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` vllm serve /home/weight/DeepSeek-V3.1_w8a8mix_mtp \ --max_num_seqs 8 \ --max-model-len 8192 \ --max-num-batched-tokens 16384 \ --tensor-parallel-size 8 \ --data-parallel-size 2 \ --enable-expert-parallel \ --served-model-name ds_r1 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --no-enable-prefix-caching \ --port 8999 \ --quantization "ascend" \ --gpu-memory-utilization 0.85 \ --trust-remote-code \ --compilation_config '{"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config='{"dynamic_eplb":true, "num_iterations_eplb_update":100, "num_wait_worker_iterations":100}' ``` input&output: 2k 2k This PR: <img width="1318" height="695" alt="fusion" src="https://github.com/user-attachments/assets/f8657813-0c02-42f4-8396-d99e730f48cd" /> Baseline: <img width="1323" height="690" alt="baseline" src="https://github.com/user-attachments/assets/e1323a78-af26-4523-820c-e20e5642a38e" /> - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: 欧派果奶我还要 <845473182@qq.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>	2025-11-30 22:52:05 +08:00
Slightwind	18eefc23c3	[feature] Support W8A8 PD-Mix Quantization (#4235 ) In PD-separated deployment scenarios: * MoE layers use dynamic quantization exclusively. * For the Attention module, Prefill (P) nodes use dynamic quantization, while Decode (D) nodes use static quantization. In PD-mixed deployment scenarios: * All components fall back to dynamic quantization, as it is difficult to distinguish between Prefill and Decode tokens. ___ - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Slightwind <slightwindsec@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-30 11:57:26 +08:00
Chao Lei	ff7061317f	[Bugfix] Fix kvpool precision synchronization (#4574 ) ### What this PR does / why we need it? Fix kvpool precision synchronization Issue https://github.com/vllm-project/vllm-ascend/issues/4412 - vLLM version: v0.11.2 --------- Signed-off-by: LCAIZJ <leichao139636@163.com>	2025-11-30 09:39:07 +08:00
weijinqian0	2b3bfe432e	[bugfix] Repair the problem of moe model accuracy caused by version upgrade. (#4562 ) Repair the problem of moe model accuracy caused by version upgrade. Reason: The new version adds the "reduce_output" operation after "forward_impl". Then we have fully taken over the implementation of the FusedMoe module. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-11-30 06:12:39 +08:00
Mengqing Cao	c84efeae25	[CI] Skip test_ngram_correctness as the oom issue block CI (#4578 ) ### What this PR does / why we need it? Skip test_ngram_correctness as the oom issue block CI related CI failure: https://github.com/vllm-project/vllm-ascend/actions/runs/19780591780/job/56680823606 ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-11-30 01:34:50 +08:00
Mengqing Cao	517fd9272d	Revert "drop ascend scheduler" (#4580 ) Reverts vllm-project/vllm-ascend#4498 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-11-29 22:20:48 +08:00
DreamerLeader	4dbe4fd123	[feature]Pooling Features and PCP Adaptation (#4143 ) This PR let pooling kv connector support pcp feature - vLLM version: v0.11.2 --------- Signed-off-by: fjw <2270923832@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2025-11-29 22:07:45 +08:00
wangxiyuan	1eb5295a1b	remove qwen3-next model file (#4573 ) Let's remove qwen3-next model filecurrently. We'll support it later by using vLLM origin model file - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 18:37:26 +08:00
Nengjun Ma	a3041cd78c	[Bugfix] fix dp parallel + tp > 1 offline inference port conflict (#4539 ) ### What this PR does / why we need it? fix dp parallel + tp > 1 offline inference port conflict issue import PR:https://github.com/vllm-project/vllm-ascend/pull/429 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-29 18:37:11 +08:00
wangxiyuan	1874265074	Move mla to ops module (#4575 ) Move mla custom op to correct module - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 18:36:55 +08:00
Shanshan Shen	2a19215e5f	[MM][Model] Remove Qwen2-VL modeling files (#4534 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-29 18:07:01 +08:00
wangxiyuan	6664a4e5ce	improve soc version (#4522 ) Make SOC_VERSION be readable for users. Now users can set simply "910b"、“910c”、“310p” - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 17:42:16 +08:00
wangxiyuan	f10acddb78	drop ascend scheduler (#4498 ) Ascend scheduler was added for non chunk prefill case before, since that the npu ops didn't work well with chunked prefill. Now the ops with chunked prefill work better, it's time to remove the ascend scheduler to use vLLM default scheduler. - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 16:18:34 +08:00
liziyu	53a52d6614	[P/D] [bugfix] add get_kv_connector_handshake_metadata func for 0.11.2 (#4567 ) ### What this PR does / why we need it? add get_kv_connector_handshake_metadata func for 0.11.2 Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-29 16:09:45 +08:00
LI SHENGYONG	0151022ab8	[bugfix] dep ineffective (#4417 ) ### What this PR does / why we need it? The expert mapping table and weights of the dynamic EPLB were not updated, causing the accuracy to be correct but not effective. This bug has now been fixed. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-29 15:18:29 +08:00
wangxiyuan	8ebbf13c1a	Update triton package name (#4563 ) Add `aarch64` suffix to make sure the package name is OK - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 15:00:40 +08:00
Ting FU	b747c95cfa	[Doc] Add single NPU tutorial for Qwen2.5-Omni-7B (#4446 ) ### What this PR does / why we need it? Add single NPU tutorial for Qwen2.5-Omni-7B - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-29 11:57:29 +08:00
Ting FU	9af34755ff	[Bugfix] Fix model run _npu_flash_attention hang issue (#4410 ) Fix model run _npu_flash_attention in _forward_prefill_no_cache hang issue, it was caused by wrong attention mask dtype. ### How was this patch tested? Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: Ting FU <futing10@huawei.com>	2025-11-29 09:20:22 +08:00
wangxiyuan	048d350f9e	update triton package url (#4552 ) Triton package url is not correct. This PR fix it Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-28 21:00:49 +08:00
shiyuan680	1c4a0468ee	【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070 ) ### What this PR does / why we need it? qwen3-next suppot triton chunk_gated_delta_rule ops ### co-owners @OsirisDuan - vLLM version: v0.11.2 Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-28 20:55:43 +08:00
fems14	5447a039b9	[Feature][main]reconstruction kvpool connector to ascend connector (#4438 ) ### What this PR does / why we need it? 1.In short, we renamed the existing MooncakeStoreConnector to AscendStoreConnector and extracted the storage engine interaction logic into a new Backend class. Associated RFC：https://github.com/vllm-project/vllm-ascend/issues/4329 2.Fixed the issue where the number of input parameters for the connector was incorrect, introduced in vllm 0.11.2 ### Does this PR introduce _any_ user-facing change? change MooncakeStoreConnector to AscendStoreConnector ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-11-28 18:08:37 +08:00
Chenxi Qian	554f16ae1f	[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 ) ### What this PR does / why we need it? This PR introduces support for adding custom CANN `aclnn` ops to `vllm-ascend`, allowing users to define and use their own custom operators. Key changes include: - Building and installing custom ops into the `vllm-ascend`-specified directory - Binding the `aclnn` op interface to the `torch.ops._C_ascend` module - Enabling invocation of these ops within `vllm-ascend` This PR includes a sample custom op: `aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from the CANN operator [`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md). Its input parameters `weight` and `weight_scale` now accept `list[torch.Tensor]` (i.e., `at::TensorList`). ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.11.2 --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>	2025-11-28 18:06:39 +08:00
herizhen	3199fe8350	[Doc]Delete equals sign (#4537 ) ### What this PR does / why we need it? Delete equals sign in doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: herizhen <you@example.com> Co-authored-by: herizhen <you@example.com>	2025-11-28 17:09:26 +08:00
wangxiaoteng888	366d2d95e8	[P/D] Add readme for PD separation (#4182 ) ### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-28 15:17:59 +08:00
Shanshan Shen	e52ebf8674	[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349 ) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at https://github.com/vllm-project/vllm/pull/29388. - [x] Simplify padding logic for FA. - [x] Add patch for https://github.com/vllm-project/vllm/pull/28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-28 14:23:00 +08:00
LHXuuu	bdc66972db	[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>	2025-11-28 14:09:39 +08:00
SILONG ZENG	ab37a7d5ae	[main]Upgrade cann to 8.3rc2 (#4350 ) ### What this PR does / why we need it? Upgrade cann to 8.3rc2 ### Does this PR introduce _any_ user-facing change? Yes, docker image will use 8.3.RC2 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-28 14:06:01 +08:00
Zhu Yi Lin	755b635844	[TEST] Add eagle proposer ut (#4447 ) ### What this PR does / why we need it? Add eagle proposer ut - vLLM version: v0.11.2 Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-27 21:59:31 +08:00
Slightwind	9fdabb7b60	[feature] Add Custom Op grouped_matmul_swiglu_quant (#4431 ) This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer to simplify the invocation of `aclnn` operators on Ascend NPUs. Key Changes: * Adapter Layer: Added `EXEC_NPU_CMD` macro and related dependencies to standardize `aclnn` calls. * Operator Support: Integrated `grouped_matmul_swiglu_quant` as a reference implementation to demonstrate the usage of the new macro. --- - vLLM version: v0.11.2 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-11-27 21:56:18 +08:00
Nengjun Ma	89a1a65300	[bugfix] fix ray start failed: local_world_size cannot little than visible device count error (#4457 ) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue #4456. This fix code is copied from vllm fixing modify, PR: [#28873](https://github.com/vllm-project/vllm/pull/28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-27 21:18:32 +08:00
drslark	1cae3e4a49	[BugFix] Adapted Qwen3-Next eager mode to v0.11.2 (#4477 ) ### What this PR does / why we need it? Adapted Qwen3-Next eager mode to `v0.11.2`. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: drslark <slarksblood@qq.com>	2025-11-27 17:44:59 +08:00
Li Wang	b220de33e8	[CI][Nightly] Support local debugging for multi-node CI test cases (#4489 ) ### What this PR does / why we need it? This patch mainly doing the following things: 1. Make k8s/lws optional for multi-node testing, allowing developers to run multi-node tests locally by actively passing in the IP addresses of all nodes. 2. Allows passing a custom proxy script path in the config file to load the proxy. - vLLM version: v0.11.2 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-11-27 17:20:29 +08:00
zzzzwwjj	1fd56b1106	chip type judgement code optimization (#4485 ) ### What this PR does / why we need it? \| \| cpu envir \| npu envir \| \|---\|---\|---\| \| set `SOC_VERSION` \| check if `SOC_VERSION` is in dict `soc_to_device`, if not, raise an error that can not support current chip type. \| print a warning log when `SOC_VERSION` is not equal to chip type from `npu-smi`, same as left for others. \| \| not set `SOC_VERSION` \| raise an error that `SOC_VERSION` is necessary when compiling in a cpu envir. \| use chip type from `npu-smi` to compile vllm-ascend. \| ### Does this PR introduce _any_ user-facing change? Now we must set env `SOC_VERSION` when compiling in cpu envir. ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-27 17:18:49 +08:00
zhangxinyuehfad	84d7f5a10d	[UT] Fix ut test (#4472 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-26 21:37:47 +08:00
herizhen	d252e36ae8	Change comment location (#4432 ) ### What this PR does / why we need it? When running 'python example.py',connection issues often occur.The solution is to comment out the first line the code. Complete the specific names of machines A2 and A3. Standardize document format,a space should be added after the colon. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 --------- Signed-off-by: herizhen <you@example.com> Co-authored-by: herizhen <you@example.com>	2025-11-26 16:13:31 +08:00
zzzzwwjj	136ea9ff56	[refact] unified soc_version code (#4359 ) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-26 14:28:55 +08:00
wangxiyuan	a91e76cd84	[CI] clean up ci (#4452 ) 1. Run 4-card test only when single and 2-card test passed 2. rename file to make it more clear 3. remove useless pd workflow, it has been managed by nightly test already. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-26 14:07:56 +08:00
wangxiyuan	bc69d7cfe1	upgrade to vllm 0.11.2 (#4400 ) Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by https://github.com/vllm-project/vllm/pull/26866 2. get_mrope_input_positions is broken by https://github.com/vllm-project/vllm/pull/28399 3. graph mode is broken by https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by https://github.com/vllm-project/vllm/pull/27583 5. `get_attn_backend_cls` and attention backend is broken are broken by https://github.com/vllm-project/vllm/pull/28534 6. spec decode is broken by https://github.com/vllm-project/vllm/pull/28771 7. sp feature is broken by https://github.com/vllm-project/vllm/pull/27126 8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922 9. lora is broken by https://github.com/vllm-project/vllm/pull/21068 10. execute_model is broken by https://github.com/vllm-project/vllm/pull/26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by https://github.com/vllm-project/vllm/pull/28159 12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753 13. dp is broken by https://github.com/vllm-project/vllm/pull/25110 What's broken and changed by ourself: 1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by https://github.com/vllm-project/vllm/pull/28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-11-26 11:48:58 +08:00
shiyuan680	d5f77f14d0	mkdir triton package and move triton files (#4420 ) ### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: shiyuan680 <917935075@qq.com>	2025-11-26 11:06:12 +08:00
Zhu Yi Lin	1b137d6b1b	[TEST] Delete Comment (#4427 ) ### What this PR does / why we need it? Delete useless comments. ### Does this PR introduce _any_ user-facing change? No - vLLM main: `2918c1b49c` Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-25 21:39:04 +08:00
wangxiyuan	98031653df	[misc] Remove useless patch_logits (#4252 ) Torch-npu 2.7.1 has fixed the device check bug. This patch can be removed now. - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-25 21:25:54 +08:00

1 2 3 4 5 ...

1508 Commits