xc-llm-ascend

Author	SHA1	Message	Date
YuanCheng-coder	ddaded1537	Add ut for envs.py (#2131 ) What this PR does / why we need it? test vllm_ascend/envs.py contains environment variables defination Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. vLLM version: v0.10.0 vLLM main: `9532a6d563` - vLLM version: v0.10.0 - vLLM main: `b4e081cb15` --------- Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-08-02 16:53:44 +08:00
xleoken	bea3d5bbb4	[Bug] Fix run bug in run_dp_server.sh (#2139 ) ### What this PR does / why we need it? For `Qwen2.5-0.5B-Instruct` model - the model's total number of attention heads (14) must be divisible by tensor parallel size. (4 -> 2) - the model does not support enable-expert-parallel ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: xleoken <xleoken@163.com>	2025-08-02 16:52:12 +08:00
yangqinghao-cmss	47f688a2f0	Change retrieving remote files to local retrieval. (#2141 ) ### What this PR does / why we need it? Using vllm's AudioAsset class to retrieve remote audio files(https://vllm-public-assets.s3.us-west-2.amazonaws.com) is not feasible in some cases; it is recommended to switch to local retrieval. ### How was this patch tested? vllm:main vllm:ascend:main results: ```bash Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:04<00:00, 4.62s/it] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:03<00:00, 3.01s/it, est. speed input: 79.03 toks/s, output: 6.31 toks/s] generated_text: The sport referenced is soccer, and the nursery rhyme is 'Hey Diddle Diddle'. ``` - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-02 16:51:22 +08:00
zhangxinyuehfad	e48f32ec59	[CI] Update image for 310p ci (#2155 ) ### What this PR does / why we need it? update the latest image for 310p ci test - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-02 16:46:02 +08:00
leo-pony	e467fe1b77	Add qwen-vl model and sampling feature UT for 310I series (#2168 ) ### What this PR does / why we need it? Add qwen-vl model and sampling feature UT for 310I series - vLLM version: v0.10.0 - vLLM main: `e0f63e4a35` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-02 11:26:12 +08:00
weijinqian0	6e00aed4d5	[main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088 ) It comes from 0.9.1dev [0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo (#1547) - vLLM version: v0.10.0 - vLLM main: `97608dc276` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com> Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: TaoYu Chen <ctynb@qq.com> Co-authored-by: taoxudonghaha <justsheldon@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-08-02 09:49:10 +08:00
leo-pony	f0c1f0c828	[Doc] Add qwen vl example in tutorials for 310I series (#2160 ) ### What this PR does / why we need it? Add qwen vl example in tutorials for 310I series. Model: Qwen2.5-VL-3B-Instruct Accuracy test result, dataset MMM-val: \| \| 910B3 \| 310P3 \| \| --- \| --- \| --- \| \|Summary\|0.455 \| 0.46 \| \|--art_and_design\| 0.558 \| 0.566 \| \|--business\| 0.373 \| 0.366 \| \|--health_and_medicine\|0.513 \| 0.52 \| \|--science\|0.333 \| 0.333 \| \|--tech_and_engineering\|0.362 \| 0.380 \| \|--humanities_and_social_science\|0.691 \| 0.691 \| Function test result: 1. On line: ![image](https://github.com/user-attachments/assets/d81bba61-df28-4676-a246-c5d094815ac7) ![image](https://github.com/user-attachments/assets/0be81628-9999-4ef2-93c1-898b3043e09e) 2. Offline: ![image](https://github.com/user-attachments/assets/603275c1-6ed6-4cfc-a6e2-7726156de087) - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-02 08:58:56 +08:00
22dimensions	8cf97d8310	[Misc] Add extra checking to torchair_graph_config. (#1939 ) ### What this PR does / why we need it? cherry-pick #1675 to main This PR adds validation checking to torchair_graph_config for better reliability. Co-authored-by: whx-sjtu <2952154980@qq.com> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:24:11 +08:00
Li Wang	2284289880	[MISC] Cherry pick #1291 from v0.9.1-dev (#1825 ) ### What this PR does / why we need it? Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization of whether `dbo` is enabled across all dp ranks. specifically, it performed allreduce op across multiple DP ranks, only when all the dp rank is `enable_dbo`, it is enabled Co-authored-by: shikang-hangzhou <459956190@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 09:08:45 +08:00
22dimensions	9e65da990e	[Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132 ) ### What this PR does / why we need it? cherry-pick #1501 from 0.9.1-dev to main Currently, Ray is not compatible with ACL Graph, so we need to fall back to eager mode when using the Ray backend. co-authored: Yizhou Liu <liu_yizhou@outlook.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:06:09 +08:00
yangqinghao-cmss	99fa0ac882	[BugFix] update the kv transfer config (#2121 ) ### What this PR does / why we need it? The functions KVTransferConfig.from_cli and AscendHcclConnector are missing in the latest vLLM version. To resolve this, I propose modifying the kv_connector to use LLMDataDistCMgrConnector, which depends on [PR #2079](https://github.com/vllm-project/vllm-ascend/pull/2079) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vllm:main vllm-ascend:mian results: ```bash Adding requests: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 374.27it/s] Processed prompts: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 66.06it/s, est. speed input: 449.08 toks/s, output: 66.51 toks/s] Prefill node is finished. INFO 07-31 09:18:30 [model_runner_v1.py:2282] Graph capturing finished in 36 secs, took 0.21 GiB INFO 07-31 09:18:30 [core.py:201] init engine (profile, create kv cache, warmup model) took 52.49 seconds INFO 07-31 09:18:30 [factory.py:74] Creating v1 connector with name: LLMDataDistCMgrConnector and engine_id: 28c8ced8-575c-4f87-840a-48d04d0edf7e INFO 07-31 09:18:30 [platform.py:157] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode INFO 07-31 09:18:30 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 76 INFO 07-31 09:18:30 [utils.py:359] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes INFO 07-31 09:18:30 [llm.py:293] Supported_tasks: ['generate'] Waiting for prefill node to finish... Adding requests: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 709.70it/s] Processed prompts: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 16.23it/s, est. speed input: 109.70 toks/s, output: 260.01 toks/s] Prompt: 'Hello, how are you today?', Generated text: " I'm a computer program, so I don't have feelings. But I can" Prompt: 'Hi, what is your name?', Generated text: ' I am a computer programmer. I have a question about the programming language I am' Prompt: 'Tell me a very long story.', Generated text: ' I want to read it. I want to read it. I want to read' Prompt: 'what is your favourite book?', Generated text: " I'm sorry, but as an AI language model, I don't have personal" Cleanup prefill resources All process done ``` - vLLM version: v0.10.0 - vLLM main: `9cb497bfa3` Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-01 08:56:55 +08:00
Li Wang	968e6791d3	[Misc] Add data preprocess functions to qwen2.5_vl_without_padding (#2148 ) ### What this PR does / why we need it? Cherry pick #1705 from v0.9.1-dev Compared qwen2_5_vl.py, qwen2_5_vl_without_padding.py missing some funtions. The purpose of this PR is to supplement these. add: - rot_pos_emb(self, grid_thw: torch.Tensor) - get_window_index(self, grid_thw) - _process_image_input(self, image_input) - _process_video_input(self, video_input) Co-authored-by: zheliuyu [15750543867@163.com](mailto:15750543867@163.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `207b750e19` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 08:54:02 +08:00
Li Wang	e3b3ffb875	[Misc] Disable quantization in mindie_turbo (#2147 ) ### What this PR does / why we need it? cherry pick #1749 from v0.9.1-dev since the interface in vllm-ascend has changed so quickly, the quantization function in mindie_turbo is no longer needed, so it needs to be discarded. Co-authored-by: zouyida [zouyida@huawei.com](mailto:zouyida@huawei.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `207b750e19` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 08:53:00 +08:00
leo-pony	c62f346f5d	Fixed 310p failure when using the sampler feature (#2151 ) ### What this PR does / why we need it? Fixed 310p failure when using the sampler feature. The root cause is: torch_npu.npu_top_k_top_p uses the operator aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support 310P. First PR that has the issue is #1308. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `207b750e19` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-01 08:43:08 +08:00
Icey	86bdde1ca8	Enable pytest and yaml style accuracy test (#2073 ) ### What this PR does / why we need it? This PR enabled pytest and yaml style accuracy test, users now can enable accuracy test by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \ --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1970 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-31 21:39:13 +08:00
huangxialu	9c9a7cd90b	[main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112 ) backport of v0.9.1-dev: https://github.com/vllm-project/vllm-ascend/pull/1902 origin main npu_moe_gating_top_k_softmax: https://github.com/vllm-project/vllm-ascend/pull/1355 - vLLM version: v0.10.0 - vLLM main: `055bd3978e` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-31 21:05:56 +08:00
Ronald1995	e8660d7978	ut:add ut for qwen2_5_vl (#2143 ) ### What this PR does / why we need it? add ut for qwen2_5_vl ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-31 20:46:17 +08:00
Ronald1995	cb0a303080	ut:add e2e test for external launcher (#2091 ) ### What this PR does / why we need it? This pr add e2e testcase to make sure initialize LLM by external_launcher method is ok. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-31 20:37:42 +08:00
Mengqing Cao	4c8842da65	[BugFix] Fix a bug of running chunked-prefill with torchair. (#1378 ) (#1844 ) This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced before assignment` when running chunked-prefill with torchair. We should calculate `decode_hs_or_q_c` whether or not torchair graphics mode is enabled. backport of #1378 fix https://github.com/vllm-project/vllm-ascend/issues/1369 - vLLM version: v0.10.0 - vLLM main: `0e36abf993` --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: whx-sjtu <2952154980@qq.com>	2025-07-31 20:08:45 +08:00
daniel	db310c6ec9	add ut for device allocator/camem and mutistream/layers (#2037 ) What this PR does / why we need it? test device allocator/camem and mutistream/layers contains resource allocation and stream ops Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2025-07-31 19:17:27 +08:00
zhanghw0354	2008152c48	[main][bugfix]Fix vLLM startup failure when inferring DeepSeek R1 model in DP scenario (#2020 ) ### What this PR does / why we need it? Fix vLLM startup failure when inferring DeepSeek R1 model in DP scenario. When running vLLM inference for the DeepSeek R1 model in DP32+TP1 configuration, the vLLM service fails to start with the following error. <img width="1786" height="918" alt="21b2011042d4f77f36f5243fa64d9c18" src="https://github.com/user-attachments/assets/df1963fe-587e-43ca-822e-a9094d0034fb" /> The root cause is a missing else branch after [this line of code](`d629f0b2b5/vllm_ascend/ops/fused_moe.py (L1411)`). This PR fixes the issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `5bbaf492a6` --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-07-31 15:30:28 +08:00
CaranLic	7c90ba5fe8	[Test] add ut for decorator.py/deepseek_mtp.py (#2127 ) ### What this PR does / why we need it? add ut for decorator.py/deepseek_mtp.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new tests - vLLM version: v0.10.0 - vLLM main: `055bd3978e` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-07-31 15:21:15 +08:00
Joey Gao	6192bc95c0	[Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051 ) bugfix cherry-pick from v0.9.1-dev https://github.com/vllm-project/vllm-ascend/pull/2007 ### What this PR does / why we need it? Minimum reproducing code： ```python # test.py from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```bash export USE_OPTIMIZED_MODEL=0 python test.py ``` exception as follow: ``` [rank0]: File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward [rank0]: q = torch_npu.npu_rotary_mul(q, cos, sin) [rank0]: File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__ [rank0]: return self._op(args, (kwargs or {})) [rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul) ``` In `AscendQwen2_5_VisionAttention_Without_Padding`, `torch_npu.npu_rotary_mul(q, cos, sin)`， `cos`/`sin` on cpu, but `q` on npu, so there will be an error. `qwen2_5_vl_without_padding.py` need this bugfix, because `AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on cpu. `40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)` ```python inv_freq = 1.0 / (theta(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim)) ``` `qwen2_5_vl.py` do not need, because `AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite `AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on device. ```python inv_freq = 1.0 / (theta*(torch.arange(0, dim, 2, dtype=torch.float) / dim)) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: pjgao <gaopengju3@huawei.com> Co-authored-by: pjgao <gaopengju3@huawei.com>	2025-07-31 15:18:54 +08:00
ApsarasX	72eceff94d	[Bugfix] `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method (#2022 ) ### What this PR does / why we need it? Fix #2033 Sync https://github.com/vllm-project/vllm/pull/14702 to solve `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by upstream vllm - vLLM version: v0.10.0 - vLLM main: `6e599eebe8` Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-31 09:03:27 +08:00
Mengqing Cao	75e28d0356	[Build][Ray] Fix protobuf version in Dockerfile (#2028 ) ### What this PR does / why we need it? Fix protobuf version in Dockerfile to resolve `AttributeError: 'str' object has no attribute 'DESCRIPTOR' when packaging message to dict` using protobuf. will remove version specification after https://github.com/ray-project/ray/pull/54910 is merged ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `0e36abf993` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-30 22:49:20 +08:00
Ronald1995	3386e09a40	ut:add ut for qwen2_vl.py (#2096 ) ### What this PR does / why we need it? add ut for qwen2_vl.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `555e7225bc` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-30 22:31:47 +08:00
Mengqing Cao	936df1cb9b	[Doc] Fix cann related urls (#2106 ) ### What this PR does / why we need it? Fix cann related urls in installation doc. ### Does this PR introduce _any_ user-facing change? The users install cann manually could use the correct url after this pr ### How was this patch tested? N/A - vLLM version: v0.10.0 - vLLM main: `5bbaf492a6` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-30 22:31:30 +08:00
Ruri	4fcca137a7	[main][Feature] Support Qwen3 W4A8 quantization (#2060 ) ### What this PR does / why we need it? Adding `W4A8_DYNAMIC` quantization support for linear. Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC` to test qwen3 w4a8_dynamic quantized model Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim` of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409` 1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim` ```shell git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409 bash install.sh ``` 2. Serve model using `vllm` ```shell VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \ --model vllm-ascend/Qwen3-8B-W4A8 \ --port 8000 \ --quantization ascend \ --tensor_parallel_size 2 \ --enforce-eager ``` - vLLM version: v0.10.0 - vLLM main: `4cd7fe6cea` --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>	2025-07-30 14:57:14 +08:00
zhangxinyuehfad	6874d666fa	[CI]Add e2e test for 310p (#1879 ) ### What this PR does / why we need it? Add e2e test for 310p: trigger conditions：tag, labels(ready-for-test, e2e-310p-test), schedule image: m.daocloud.io/quay.io/ascend/cann:8.1.rc1-310p-ubuntu22.04-py3.10 runner: linux-aarch64-310p-1, linux-aarch64-310p-4 model: IntervitensInc/pangu-pro-moe-model, Qwen/Qwen3-0.6B-Base, Qwen/Qwen2.5-7B-Instruct - vLLM version: v0.10.0 - vLLM main: `b917da442b` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-30 14:52:16 +08:00
YuanCheng-coder	34dd24adf2	add ut for vocab_parallel_embedding (#2067 ) ### What this PR does / why we need it? test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel embedding forward CI passed with new added test. vLLM version: v0.10.0 vLLM main: `2cc571199b` - vLLM version: v0.10.0 - vLLM main: `05cbbe20c5` Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-07-30 14:35:45 +08:00
Yikun Jiang	d9f82ebfce	[misc] Add reminder comment when PR submitted (#2092 ) ### What this PR does / why we need it? Add reminder comment when PR submitted ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally: https://github.com/Yikun/vllm-ascend/pull/51#issuecomment-3132425126 This PR will take effect after this PR merged. - vLLM version: v0.10.0 - vLLM main: `0e36abf993` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-30 10:14:33 +08:00
hongfugui	1dbb888275	[Bugfix] LoRA logits einsum dimension mismatch in add_lora_logits (#1583 ) ### What this PR does / why we need it? This PR fixes a tensor shape mismatch in `add_lora_logits`. Previously, `lora_a_stacked` was passed as shape `[num_loras, in_dim, rank]`, which does not match the expected einsum pattern `"bi, boi -> bo"` used in `bgmv_shrink`. This causes runtime errors like: RuntimeError: einsum(): subscript i has size 3 for operand 1 which does not broadcast with previously seen size 4 ![image](https://github.com/user-attachments/assets/63029479-49ae-4c3c-b995-f6805d15ad06) This fix transposes `lora_a_stacked` and `lora_b_stacked` to match the expected shapes: - `lora_a`: `[num_loras, rank, in_dim]` - `lora_b`: `[num_loras, out_dim, rank]` All unit tests pass after this fix. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` import torch import pytest from unittest.mock import patch, PropertyMock, ANY from vllm_ascend.lora.punica_wrapper.punica_npu import PunicaWrapperNPU @pytest.fixture def wrapper_cpu(): cfg = {"max_num_batched_tokens": 10, "max_batches": 2, "device": "cpu"} w = PunicaWrapperNPU(**cfg) w.is_prefill = True w.no_lora = False return w def test_add_lora_logits(wrapper_cpu): batch_size = 2 hidden_size = 4 lora_rank = 3 vocab_size = 5 y = torch.zeros(batch_size, vocab_size) x = torch.randn(batch_size, hidden_size) num_loras = 1 lora_a = torch.randn(num_loras, hidden_size, lora_rank) lora_b = torch.randn(num_loras, lora_rank, vocab_size) with patch.object(wrapper_cpu.__class__, "sampler_indices", new_callable=PropertyMock) as mock_idx: mock_idx.return_value = torch.zeros(batch_size, dtype=torch.long) wrapper_cpu.add_lora_logits(y, x, lora_a, lora_b, scale=1.0) assert y.shape == (batch_size, vocab_size) assert not torch.allclose(y, torch.zeros_like(y)) Signed-off-by: hongfugui <hongfugui_yewu@cmss.chinamobile.com>	2025-07-30 09:50:36 +08:00
Mengqing Cao	d80b0cca5d	[CI] Fix test on pyhccl to 2 cards (#2094 ) ### What this PR does / why we need it? Fix test on pyhccl to 2 cards ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `0d0cc9e150` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-30 09:08:00 +08:00
wangxiyuan	9b67c87b14	[Refactor]Refactor sampler (#2050 ) Refactor Sampler implementation from patch way to inherit from vLLM Sampler interface. Next step: Make the op `TopKTopPSampler` in vLLM support custom ops register mechanism - vLLM version: v0.10.0 - vLLM main: `61a6905ab0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-30 08:47:22 +08:00
whx	b6a7f07c70	[Perf][MoE] Improve MoE multistream parallel performace. (#1891 ) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: `2cc571199b` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-29 23:53:19 +08:00
leo-pony	4df8e0027c	[e2e]Fixed the issue that pyhccl e2e cannot run continuously with other tests (#1246 ) ### What this PR does / why we need it? 1.Fixed the issue that pyhccl e2e cannot run continuously with other tests. 2.Cleaned up the resources occupied by the dynamic_npugraph_batchsize e2e test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a e2e test e2e multi-cards tests local running successfully. - vLLM version: v0.9.2 - vLLM main: `0df4d9b06b` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-07-29 19:38:30 +08:00
Shanshan Shen	61fc35184b	[Doc] Add performance tuning doc to main (#1392 ) ### What this PR does / why we need it? Add performance tuning doc to main. Closes: https://github.com/vllm-project/vllm-ascend/issues/1387 - vLLM version: v0.9.1 - vLLM main: `923147b5e8` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-07-29 19:36:34 +08:00
taoxudonghaha	540336edc9	Add Custom Kernels For LoRA Performance (#1884 ) ### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: `40d86ee412` --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>	2025-07-29 19:27:50 +08:00
TaoYu Chen	2da281ec5a	bump default python version to 3.11 (#2072 ) ### What this PR does / why we need it? Bump default python version to 3.11, see #1980 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI - vLLM version: v0.10.0 - vLLM main: `12a223ef9b` Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-07-29 19:07:17 +08:00
Li Wang	f60bb474f9	[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065 ) ### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `a2480251ec` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-29 18:59:05 +08:00
curryliu	ca8007f584	[Feature] Enable inference support for Deepseekr1-w8a8-MTP (#1994 ) Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers. - vLLM version: v0.9.2 - vLLM main: `6eca337ce0` Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>	2025-07-29 18:51:57 +08:00
whx	98cadc2146	[Perf] Avoid performing index selection of sin/cos cache every layer (#1890 ) Optimize number of index selections of sin/cos cache. - vLLM version: v0.10.0 - vLLM main: `656c24f1b5` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-29 18:06:45 +08:00
wangxiyuan	0190b68f51	[Misc]Remove PD v0 code (#2047 ) Cleanup V0 disaggregated prefill code for V0 Engine. part of https://github.com/vllm-project/vllm-ascend/issues/1620 TODO: enable v1 e2e test. - vLLM version: v0.10.0 - vLLM main: `2cc571199b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 19:09:22 +08:00
Yikun Jiang	935e9d4c9d	Pin transformers to fix v0.9.1 doctest (#2048 ) ### What this PR does / why we need it? Pin transformers to fix v0.9.1 doctest ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest passed - vLLM version: v0.10.0 - vLLM main: `c657369841` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-28 17:51:56 +08:00
huangxialu	1a25b0a2dd	[Test] add ut for qwen3_moe.py (#2055 ) ### What this PR does / why we need it? Add ut for qwen3_moe.py ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-28 17:37:13 +08:00
whx	e7d32ed3f1	[BugFix] Fix the problem that torchair doesn't support tp > 4. (#1508 ) This PR removes the restriction that TP cannot be greater than 4 in torchair scenario, because current newest version of CANN has fixed this bug. - vLLM version: v0.10.0 - vLLM main: `04ff4be310` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-28 16:48:05 +08:00
wangxiyuan	4a008c4dac	[Misc]Clean up useless import from vllm (#2049 ) Clean up useless import from vllm to make code more clear. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 16:01:59 +08:00
wangxiyuan	34cfdf5520	[Misc] Fix logger bug (#2024 ) 1. Remove useless logger 2. Fix logger bug, same problem as https://github.com/vllm-project/vllm-ascend/pull/515 - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 15:59:09 +08:00
LeeWenquan	3ad582c9a9	[Test] Add ut for files in /attention (#1944 ) ### What this PR does / why we need it? Add ut for files in folder /attention ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `139a7f07bd` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-28 15:54:40 +08:00
Ronald1995	32a9c5f694	[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926 ) ### What this PR does / why we need it? it'll execute allreduce and malmul seperately in vllm RowParallelLinear forward funcion, this function use torch_npu.npu_mm_all_reduce_base to execute allreduce and matmul in a fused kernel way. this will gain a 20% performance promotion in eager mode. ### Does this PR introduce _any_ user-facing change? this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to control whether enable the feature or not. ### How was this patch tested? the patch is tested by adding a new test file `test_patch_linear.py` to guard the ut - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-28 15:13:37 +08:00

1 2 3 4 5 ...

640 Commits