xc-llm-ascend

Author	SHA1	Message	Date
whx	98cadc2146	[Perf] Avoid performing index selection of sin/cos cache every layer (#1890 ) Optimize number of index selections of sin/cos cache. - vLLM version: v0.10.0 - vLLM main: `656c24f1b5` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-29 18:06:45 +08:00
wangxiyuan	0190b68f51	[Misc]Remove PD v0 code (#2047 ) Cleanup V0 disaggregated prefill code for V0 Engine. part of https://github.com/vllm-project/vllm-ascend/issues/1620 TODO: enable v1 e2e test. - vLLM version: v0.10.0 - vLLM main: `2cc571199b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 19:09:22 +08:00
huangxialu	1a25b0a2dd	[Test] add ut for qwen3_moe.py (#2055 ) ### What this PR does / why we need it? Add ut for qwen3_moe.py ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-28 17:37:13 +08:00
LeeWenquan	3ad582c9a9	[Test] Add ut for files in /attention (#1944 ) ### What this PR does / why we need it? Add ut for files in folder /attention ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `139a7f07bd` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-28 15:54:40 +08:00
Ronald1995	32a9c5f694	[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926 ) ### What this PR does / why we need it? it'll execute allreduce and malmul seperately in vllm RowParallelLinear forward funcion, this function use torch_npu.npu_mm_all_reduce_base to execute allreduce and matmul in a fused kernel way. this will gain a 20% performance promotion in eager mode. ### Does this PR introduce _any_ user-facing change? this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to control whether enable the feature or not. ### How was this patch tested? the patch is tested by adding a new test file `test_patch_linear.py` to guard the ut - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-28 15:13:37 +08:00
zzzzwwjj	ba3dfbd59e	[main][refactor] Refactoring forward_context and model_runner_v1 (#1979 ) ### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `57c22e57f9` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-07-28 14:06:20 +08:00
zhangxinyuehfad	d1c640841b	[Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803 ) ### What this PR does / why we need it? Fix num_hidden_layers when Qwen2-Audio 7B and #1760 ： ``` INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode Traceback (most recent call last): File "/workspace/test1.py", line 58, in <module> main(audio_count) File "/workspace/test1.py", line 38, in main llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct", File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args vllm_config = engine_args.create_engine_config(usage_context) File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config config = VllmConfig( File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__ s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__ current_platform.check_and_update_config(self) File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config update_aclgraph_sizes(vllm_config) File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__ return super().__getattribute__(key) AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers' ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/1780 https://github.com/vllm-project/vllm-ascend/issues/1760 https://github.com/vllm-project/vllm-ascend/issues/1276 https://github.com/vllm-project/vllm-ascend/issues/359 - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-26 20:13:00 +08:00
Pleaplusone	df0ec55162	Disaggregate prefill for kv cache register style (#950 ) ### What this PR does / why we need it? This PR adopt `LLMDataDist` for kv cache register and `pull_blocks` style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: - Generate the rank table for all machine. - execute`toy_proxy.py` to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port - Run the prefill server and decode server. - send the request to the disaggregate prefill proxy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Signed-off-by: liziyu179 <3475441767@qq.com> Signed-off-by: underfitc <hucong24@huawei.com> Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Co-authored-by: liziyu179 <3475441767@qq.com> Co-authored-by: underfitc <hucong24@huawei.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-07-26 17:15:47 +08:00
Yikun Jiang	17a430f7b8	Upgrade vLLM to v0.10.0 (#1927 ) ### What this PR does / why we need it? - Upgrade to v0.10.0 - Drop v0.9.2 version compatibility - Add patch for `vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py` as workaround of `f3a683b7c9` for v0.10.0 and also add e2e test `test_models_prompt_logprobs` - Pin transformers<4.54.0 as workaround of https://github.com/vllm-project/vllm-ascend/issues/2034 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test locally: `VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs` - CI passed - vLLM version: v0.9.2 - vLLM main: `7728dd77bb` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 15:43:29 +08:00
Ronald1995	e561a2c6ec	ut:add ut for qwen2_5_vl_without_padding.py (#1988 ) ### What this PR does / why we need it? this pr is to add ut for qwen2_5_vl_without_padding.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? this is only a ut test - vLLM version: v0.9.2 - vLLM main: `9c8b2c2a8a` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-25 14:12:44 +08:00
SunnyLee151064	ae560f7131	[Test] Add uts for files in /core (#1957 ) ### What this PR does / why we need it? Add uts for files in folder /core ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `5a19a6c670` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-25 09:48:19 +08:00
SunnyLee151064	ab7d5aca5d	[Test] Add ut for files in /multistream (#1947 ) ### What this PR does / why we need it? Add some uts for files in folder /multistream ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `b77c7d327f` Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-24 10:42:49 +08:00
SunnyLee151064	34571ea5ae	[Test] Add ut for files in /distributed (#1951 ) ### What this PR does / why we need it? Add some ut for files in folder /distributed ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-24 10:36:11 +08:00
Zac	2ffe051859	[Test]add ut for deepseek_v2. (#1964 ) What this PR does / why we need it? Add uts for deepseek_v2 Does this PR introduce any user-facing change? No How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `f3137cdd81` --------- Signed-off-by: 张帮政 <zhangbangzheng@huawei.com>	2025-07-24 10:27:50 +08:00
wangxiyuan	846555cdb5	[Misc] Clean up uesless code in attention (#1933 ) Before do attention module refactor, we can do some code cleanup to make the next step easier. What this PR does: 1. remove uesless `common_prefix_len` for attention builder 2. remove uesless `is_only_prefill` and `num_input_tokens` in attention metadata. 3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead, `CommonAttentionMetadata` is over designed and uesless 4. update the attention backend input parameters to keep the same as vLLM. 5. Rename attention name to the same style with `ASCEND` prefix - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-24 10:23:34 +08:00
shiyuan680	ac0bf133f4	add ut of fused_moe.py (#1930 ) ### What this PR does / why we need it? add unit test for fused_moe.py - vLLM version: v0.9.2 - vLLM main: `2dec7c1a5d` Signed-off-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: yangcheng <yangcheng104@huawei.com>	2025-07-23 16:24:09 +08:00
weichen	ac773aca43	Add UT for Patches (#1766 ) ### What this PR does / why we need it? Add UT for patches in vLLM Ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Irrelevant - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-07-23 16:07:20 +08:00
Mengqing Cao	3aa3b46bfe	[V1][PP] Support pp with ray backend in V1 (#1800 ) ### What this PR does / why we need it? Support pipeline parallel with ray backend in V1Engine. Fixes #1751 ### Does this PR introduce _any_ user-facing change? Users could specify ray as distributed backend when inferencing with pp ### How was this patch tested? CI passed with new added test. - vLLM version: v0.9.2 - vLLM main: `32142b3c62` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-23 14:52:52 +08:00
JohnJan	ce4970eee0	[Test] Add unit test for schedule_config.py (#1590 ) What this PR does / why we need it? According to issue https://github.com/vllm-project/vllm-ascend/issues/1298 , this pull request adds unit test code for schedule_config.py. Does this PR introduce any user-facing change? No How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2`	2025-07-22 11:43:25 +08:00
Yikun Jiang	5f0b42e414	[FOLLOWUP] Use base test to avoid patch everwhere (#1634 ) ### What this PR does / why we need it? Use base test to avoid patch everwhere. Followup here: https://github.com/vllm-project/vllm-ascend/pull/1566 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut ci passed - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-22 09:03:40 +08:00
wangxiyuan	7265dc090d	[2/4][Refactor] Refactor torchair utils (#1892 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker 2. Refactor utils (this PR) - simple change that move all torchair related util function to torchair module 3. Refactor model_runner 4. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 19:43:30 +08:00
wangxiyuan	af56ae3ed1	[1/4][Refactor] Refactor torchair worker (#1885 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker (this PR) - create torchair module and move torchair related code in worker to the new module 3. Refactor utils 4. Refactor model_runner 5. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 11:50:46 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
wangxiyuan	a8b316ac5b	[CI] Make AttentionBackend interface compatible to fix broken CI (#1893 ) vLLM commit `752c6ade2e` removed `blocksparse_params` for attention backend. This PR does the same change to make CI happy. - vLLM version: v0.9.2 - vLLM main: `9499e26e2a` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-21 08:21:06 +08:00
lianyibo	53d2ea3789	[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 (#1811 ) ### What this PR does / why we need it? maybe fixes [#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test Qwen3-32B tp=4 with: ```bash vllm serve --port 1234 Qwen/Qwen3-32B \ --served-model-name Qwen3-32B \ --tensor-parallel-size 4 \ --swap-space 16 \ --max-model-len 6000 \ --load-format dummy \ --disable-log-stats \ --disable-log-requests \ ``` Request batch_size=128 input/output token=1024 In 0.9.2rc1 ```text ===================================================== Total TPS with prefill(tokens/s) : 785.1395 Total TPS without prefill : 846.6809 Mean TPS with prefill : 6.1339 Mean TPS without prefill : 6.6147 ===================================================== Mean TTFT(ms) : 10307.8123 Max TTFT(ms) : 21423.0733 Min TTFT(ms) : 362.3602 ===================================================== Mean TPOT(ms) : 151.3051 Max TPOT(ms) : 159.4649 Min TPOT(ms) : 140.899 ===================================================== Total Time(s) : 175.6032 Request Throughput(requests/s) : 0.7289 ===================================================== ``` Apply this PR ```text ===================================================== Total TPS with prefill(tokens/s) : 811.0014 Total TPS without prefill : 876.4423 Mean TPS with prefill : 6.3359 Mean TPS without prefill : 6.8472 ===================================================== Mean TTFT(ms) : 10263.8382 Max TTFT(ms) : 21151.2547 Min TTFT(ms) : 375.9136 ===================================================== Mean TPOT(ms) : 146.1686 Max TPOT(ms) : 154.0957 Min TPOT(ms) : 136.8879 ===================================================== Total Time(s) : 169.8579 Request Throughput(requests/s) : 0.7536 ===================================================== ``` The TPOT performance gap between these two sets of data is about 3%. - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` Signed-off-by: lianyibo <lianyibo1@kunlunit.com>	2025-07-18 23:09:54 +08:00
Mengqing Cao	574fe407eb	[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841 ) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-18 23:07:14 +08:00
xudongLi-cmss	33ef5dc813	add unit test for func wrapper (#1863 ) ### What this PR does / why we need it? test func wrapper file ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added test. - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>	2025-07-18 11:05:17 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00
wangxiyuan	787010a637	[Test] Remove VLLM_USE_V1 in example and tests (#1733 ) V1 is enabled by default, no need to set it by hand now. This PR remove the useless setting in example and tests - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 12:49:57 +08:00
wangxiyuan	7bdada58eb	[Misc] Remove VLLM_USE_V1 usage in code (#1764 ) We plan to remove V0 code from this version. The first step is to delete v0 usage. Related: https://github.com/vllm-project/vllm-ascend/issues/1620 - vLLM version: v0.9.2 - vLLM main: `61e20828da` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:16 +08:00
Pr0Wh1teGivee	d13fb0766e	[Perf] add patch to optimize apply_topk_topp (#1732 ) ### What this PR does / why we need it? Performance optimization for apply_top_k_top_p ### Does this PR introduce _any_ user-facing change? Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature ### How was this patch tested? e2e & ut - vLLM version: v0.9.2 - vLLM main: `6a9e6b2abf` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-07-11 15:32:02 +08:00
ApsarasX	643e6f5486	[Bugfix] Fix accuracy problem caused by mask pollution (#1678 ) ### What this PR does / why we need it? If a small batch of short requests is sent first, forming a chunk with a length <128, it will corrupt the `attn_mask_cache`, causing subsequent requests that do not form a chunk to have accuracy issues. The root cause of this problem is the use of in-place multiplication. Modifying it to use out-of-place multiplication will resolve the accuracy problem. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Yes. - vLLM version: v0.9.2 - vLLM main: `ad6c2e1a0b` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-10 14:06:49 +08:00
wangxiyuan	392fd7239b	[Misc] Add attention mask (#1673 ) Move attention mark from V0 to common place. - vLLM version: v0.9.2 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 09:12:03 +08:00
wangxiyuan	830332ebfc	Clean up v0.9.1 code (#1672 ) vllm has released 0.9.2. This PR drop 0.9.1 support. - vLLM version: v0.9.1 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:52:24 +08:00
NeverRaR	71de52d3a9	feat: add kv cache memory cache and skip dynamo guard (#1549 ) ### What this PR does / why we need it? 1、Sometimes loading torchair cache will fail because of the floating of npu memory, so this pr add a new cache to save the old kv cache bytes to avoid the possible crash while loading the torchair graph cache. 2、When caching is enabled and does not exist, the first compilation introduces the overhead of Dynamo Gurad. So in this case, we will compile them directly twice to skip them (This will bring 3-4 ms of tpot optimization) ### Does this PR introduce _any_ user-facing change? Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to control kv cache floating tolerance ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `1fd471e957` Signed-off-by: boying <897013703@qq.com>	2025-07-07 22:37:14 +08:00
wangyanhui-cmss	4e29c5a808	Add ut for test_pooling_model_runner.py (#1640 ) ### What this PR does / why we need it? Add ut for test_pooling_model_runner.py ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? python -m unittest test_pooling_model_runner.py - vLLM version: v0.9.1 - vLLM main: `2e610deb72` --------- Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>	2025-07-07 17:12:11 +08:00
Yikun Jiang	0c1d239df4	Add unit test local cpu guide and enable base testcase (#1566 ) ### What this PR does / why we need it? Use Base test and cleanup all manaul patch code - Cleanup EPLB config to avoid tmp test file - Use BaseTest with global cache - Add license - Add a doc to setup unit test in local env ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-06 10:42:27 +08:00
wangxiyuan	343955c7ac	[CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625 ) This commit `78fe77534b` from vllm reverted the change for FusedMoEParallelConfig This PR do the same to fix the CI error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-04 17:54:33 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Angazenn	9fbd8017c0	[Quantization]300I Duo support w8a8 quantization (#1560 ) ### What this PR does / why we need it? This pr supports w8a8 on 300I Duo platform. The main change is to use `npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? offline inference on 310p runs normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:12:46 +08:00
wangxiyuan	a45dfde283	[CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602 ) Make CI happy 1. `c1909e7e8c` changed moeConfig init way 2. `48fb076cbc` changed input batch logic. This PR address these change to vllm-ascend. Closes: https://github.com/vllm-project/vllm-ascend/issues/1600 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-03 18:36:17 +08:00
zhanghw0354	9fb3d558e5	[Test]Add unit test for platform.py (#1476 ) ### What this PR does / why we need it? According to issue #1298 , this pull request adds unit test code for platform.py. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhuyilin <809721801@qq.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Angazenn <92204292+Angazenn@users.noreply.github.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: Zhu Yi Lin <116337067+GDzhu01@users.noreply.github.com>	2025-07-02 17:46:06 +08:00
Zhu Yi Lin	6b80c5acba	Fix W8A8 fused moe bug (#1529 ) ### What this PR does / why we need it? 1. drop some useless code for w8a8 fusedmoe 2. Add in8 kv cache check 3. Add more ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: zhuyilin <809721801@qq.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-02 16:40:51 +08:00
Agonixiaoxiao	7fc1a98489	add ut for kv tansfer module (#1531 ) ### What this PR does / why we need it? test kv data transfer contains connect,pipe,buffer ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: lixudong <lixudong@cmss.chinamobile.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: lixudong <lixudong@cmss.chinamobile.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-07-02 16:14:52 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
wangxiyuan	5968dff4e0	[Build] Add build info (#1386 ) Add static build_info py file to show soc and sleep mode info. It helps to make the code clean and the error info will be more friendly for users This PR also added the unit test for vllm_ascend/utils.py This PR also added the base test class for all ut in tests/ut/base.py Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-27 09:14:43 +08:00
wangyanhui-cmss	e5eea64b66	[CI/UT] Add ut for parallel_state.py (#1460 ) ### What this PR does / why we need it? Add ut for parallel_state.py ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? python -m unittest test_parallel_state.py --------- Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>	2025-06-26 19:03:27 +08:00
Pr0Wh1teGivee	2fda60464c	[Perf] Use fused ops npu_top_k_top_p (#1308 ) ### What this PR does / why we need it? Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` . This patch are using `npu_top_k_top_p` which required torch_npu>=2.5.1.post1.dev20250619 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by DeepSeek R1 and UT passed Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-06-25 20:59:06 +08:00
wangxiyuan	b350edae9a	[UT] refactor test_expert_load_balancer and fix broken CI (#1293 ) refactor test_expert_load_balancer to keep the ut code style This PR also fixed the break change from https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280 This is just a quick fix. We'll support embedding on V1 later Closes: https://github.com/vllm-project/vllm-ascend/issues/1299 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 01:02:52 +08:00
songshanhu07	ebb2a70dbb	static EPLB fix bug, add unit test (#1186 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1.add static EPLB unit test 2.fix bug: Tensor cannot be directly judged by if statements ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Run the unit test. --------- Signed-off-by: songshanhu07 <1763685535@qq.com>	2025-06-18 19:46:56 +08:00

1 2

51 Commits