xc-llm-ascend

Author	SHA1	Message	Date
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
Li Wang	57b9f02185	[Bugfix] Fix disaggregated pd error (#2242 ) ### What this PR does / why we need it? Fix `ascend_env has no attr VLLM_ASCEND_ENABLE_CHUNK_MC2`, remove useless lines - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:48:10 +08:00
xuyexiong	26fc36b0e0	[V1] MTP supports torchair (#2145 ) ### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: `302962e806` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-08-06 19:37:43 +08:00
jinyuxin	583ad8f347	[main][refractor] Refractor forward metadata retrieval across DP nodes to reduce redundant padding. (#2062 ) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does： 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode disabled, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: `bd3db7f469` Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-05 17:03:36 +08:00
Li Wang	ad366bf908	[Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181 ) ### What this PR does / why we need it? This pr fix broken CI: 1. Fix the `ee2eb6ecd8` changes, in this commit, they fused the gate and up projections in the vision MLP, This can improve performance by reducing one matrix multiplication. so, this pr do the following things: - Specify that the two linear layers are fused as `mlp.gate_up_proj` when loading the weights. - Use a SiluAndMul activation function. 2. Fix `aefeea0fde`, Update ModelRunnerOutput parameters to adapt to its changes 3. Fix [vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356), fix qwen moe ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `fed5849d3f` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-04 21:37:50 +08:00
Pleaplusone	f939381c6f	[Bugfix] Adopt the new changes on disaggregated pd from vllm main branch (#2122 ) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR https://github.com/vllm-project/vllm/pull/21072 and https://github.com/vllm-project/vllm/pull/21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-04 10:08:58 +08:00
Li Wang	2284289880	[MISC] Cherry pick #1291 from v0.9.1-dev (#1825 ) ### What this PR does / why we need it? Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization of whether `dbo` is enabled across all dp ranks. specifically, it performed allreduce op across multiple DP ranks, only when all the dp rank is `enable_dbo`, it is enabled Co-authored-by: shikang-hangzhou <459956190@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 09:08:45 +08:00
ApsarasX	72eceff94d	[Bugfix] `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method (#2022 ) ### What this PR does / why we need it? Fix #2033 Sync https://github.com/vllm-project/vllm/pull/14702 to solve `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by upstream vllm - vLLM version: v0.10.0 - vLLM main: `6e599eebe8` Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-31 09:03:27 +08:00
wangxiyuan	9b67c87b14	[Refactor]Refactor sampler (#2050 ) Refactor Sampler implementation from patch way to inherit from vLLM Sampler interface. Next step: Make the op `TopKTopPSampler` in vLLM support custom ops register mechanism - vLLM version: v0.10.0 - vLLM main: `61a6905ab0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-30 08:47:22 +08:00
whx	98cadc2146	[Perf] Avoid performing index selection of sin/cos cache every layer (#1890 ) Optimize number of index selections of sin/cos cache. - vLLM version: v0.10.0 - vLLM main: `656c24f1b5` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-29 18:06:45 +08:00
wangxiyuan	4a008c4dac	[Misc]Clean up useless import from vllm (#2049 ) Clean up useless import from vllm to make code more clear. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 16:01:59 +08:00
zzzzwwjj	ba3dfbd59e	[main][refactor] Refactoring forward_context and model_runner_v1 (#1979 ) ### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `57c22e57f9` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-07-28 14:06:20 +08:00
Pleaplusone	df0ec55162	Disaggregate prefill for kv cache register style (#950 ) ### What this PR does / why we need it? This PR adopt `LLMDataDist` for kv cache register and `pull_blocks` style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: - Generate the rank table for all machine. - execute`toy_proxy.py` to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port - Run the prefill server and decode server. - send the request to the disaggregate prefill proxy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Signed-off-by: liziyu179 <3475441767@qq.com> Signed-off-by: underfitc <hucong24@huawei.com> Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Co-authored-by: liziyu179 <3475441767@qq.com> Co-authored-by: underfitc <hucong24@huawei.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-07-26 17:15:47 +08:00
Yikun Jiang	17a430f7b8	Upgrade vLLM to v0.10.0 (#1927 ) ### What this PR does / why we need it? - Upgrade to v0.10.0 - Drop v0.9.2 version compatibility - Add patch for `vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py` as workaround of `f3a683b7c9` for v0.10.0 and also add e2e test `test_models_prompt_logprobs` - Pin transformers<4.54.0 as workaround of https://github.com/vllm-project/vllm-ascend/issues/2034 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test locally: `VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs` - CI passed - vLLM version: v0.9.2 - vLLM main: `7728dd77bb` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 15:43:29 +08:00
Li Wang	2f50304c19	[Bugfix] Add get_supported_tasks interface to fix broken CI (#2023 ) ### What this PR does / why we need it? Added `get_supported_tasks` interface to adapt to vllm [changes](`46d81d6951 (diff-80ee7e2a62f9dcfbb8a312dc4e3948557e97ef187290daebbcae1e28596bda29)`) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `5ac3168ee3` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-26 08:20:21 +08:00
wangxiyuan	846555cdb5	[Misc] Clean up uesless code in attention (#1933 ) Before do attention module refactor, we can do some code cleanup to make the next step easier. What this PR does: 1. remove uesless `common_prefix_len` for attention builder 2. remove uesless `is_only_prefill` and `num_input_tokens` in attention metadata. 3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead, `CommonAttentionMetadata` is over designed and uesless 4. update the attention backend input parameters to keep the same as vLLM. 5. Rename attention name to the same style with `ASCEND` prefix - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-24 10:23:34 +08:00
Mengqing Cao	3aa3b46bfe	[V1][PP] Support pp with ray backend in V1 (#1800 ) ### What this PR does / why we need it? Support pipeline parallel with ray backend in V1Engine. Fixes #1751 ### Does this PR introduce _any_ user-facing change? Users could specify ray as distributed backend when inferencing with pp ### How was this patch tested? CI passed with new added test. - vLLM version: v0.9.2 - vLLM main: `32142b3c62` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-23 14:52:52 +08:00
Li Wang	33e1ea4d1a	[CI] Fix broken CI (#1915 ) ### What this PR does / why we need it? Fix [#21227](https://github.com/vllm-project/vllm/pull/21227) to make ci happy - vLLM version: v0.9.2 - vLLM main: `6b46c4b653` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-22 08:38:30 +08:00
wangxiyuan	7265dc090d	[2/4][Refactor] Refactor torchair utils (#1892 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker 2. Refactor utils (this PR) - simple change that move all torchair related util function to torchair module 3. Refactor model_runner 4. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 19:43:30 +08:00
wangxiyuan	2b726d8f90	[CI] Fix broken CI (#1889 ) 1. vLLM commit `45badd05d0` changed the pooling check logic which broken vLLM Ascend. 2. vLLM commit `3e04107d97` requires higher version of transformers. The transformers version bug has been fixed by `e936e401de`. We can safe to remove the version limit now. 3. vLLM commit `217937221b` added a new input `enable_eplb` for FusedMoe Ops This PR fix the broken CI. - vLLM version: v0.9.2 - vLLM main: `6a971ed692` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-20 02:11:57 +08:00
Li Wang	f9dfde02fd	[Bugfix] Fix broken CI (#1848 ) ### What this PR does / why we need it? - Fix broken commit by [#20927](https://github.com/vllm-project/vllm/pull/20927) - Fix broken commit by [#20466](https://github.com/vllm-project/vllm/pull/20466) - TODO: more fully adapt to the upstream reconstruction, let's first make CI happy - vLLM version: v0.9.2 - vLLM main: `11dfdf21bf` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-17 20:10:12 +08:00
wangxiyuan	494b0f474f	[CI]Fix broken CI (#1773 ) This PR fixed the broken CI. It require https://github.com/vllm-project/vllm/pull/20900 merged first. - vLLM version: v0.9.2 - vLLM main: `e8cc53af5e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 00:54:20 +08:00
weiguihua2	aa4240c67f	Support pipeline parallel in V1 Engine (#1700 ) ### What this PR does / why we need it? This patch supports pipeline parallel in V1 Engine ### Does this PR introduce _any_ user-facing change? Yes, users can run PP in V1 ### How was this patch tested? Manully test - vLLM version: v0.9.2 - vLLM main: `31d5c1797f` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-07-11 15:30:51 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00
wangxiyuan	b979ee353d	[Misc] Code clean up (#1679 ) Make model_runner_v1 more readable - vLLM version: v0.9.2 - vLLM main: `baed180aa0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 14:33:40 +08:00
wangxiyuan	392fd7239b	[Misc] Add attention mask (#1673 ) Move attention mark from V0 to common place. - vLLM version: v0.9.2 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 09:12:03 +08:00
wangxiyuan	830332ebfc	Clean up v0.9.1 code (#1672 ) vllm has released 0.9.2. This PR drop 0.9.1 support. - vLLM version: v0.9.1 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:52:24 +08:00
NeverRaR	71de52d3a9	feat: add kv cache memory cache and skip dynamo guard (#1549 ) ### What this PR does / why we need it? 1、Sometimes loading torchair cache will fail because of the floating of npu memory, so this pr add a new cache to save the old kv cache bytes to avoid the possible crash while loading the torchair graph cache. 2、When caching is enabled and does not exist, the first compilation introduces the overhead of Dynamo Gurad. So in this case, we will compile them directly twice to skip them (This will bring 3-4 ms of tpot optimization) ### Does this PR introduce _any_ user-facing change? Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to control kv cache floating tolerance ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `1fd471e957` Signed-off-by: boying <897013703@qq.com>	2025-07-07 22:37:14 +08:00
Vincent Yuan	eb390545ec	[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591 ) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See #1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339 Signed-off-by: Vincent Yuan <farawayboat@gmail.com>	2025-07-05 16:29:21 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Angazenn	9fbd8017c0	[Quantization]300I Duo support w8a8 quantization (#1560 ) ### What this PR does / why we need it? This pr supports w8a8 on 300I Duo platform. The main change is to use `npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? offline inference on 310p runs normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:12:46 +08:00
wangxiyuan	a45dfde283	[CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602 ) Make CI happy 1. `c1909e7e8c` changed moeConfig init way 2. `48fb076cbc` changed input batch logic. This PR address these change to vllm-ascend. Closes: https://github.com/vllm-project/vllm-ascend/issues/1600 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-03 18:36:17 +08:00
Zhu Yi Lin	6b80c5acba	Fix W8A8 fused moe bug (#1529 ) ### What this PR does / why we need it? 1. drop some useless code for w8a8 fusedmoe 2. Add in8 kv cache check 3. Add more ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: zhuyilin <809721801@qq.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-02 16:40:51 +08:00
wangxiyuan	641a4e6092	[CI] Cache sampled token ids in model runner to fix CI error (#1573 ) ### What this PR does / why we need it? vllm change `7f280d69c9` break vllm-ascend. This PR Fix the broken CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? passed Closes: https://github.com/vllm-project/vllm-ascend/issues/1572 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-02 12:11:14 +08:00
Pleaplusone	0e43813120	[ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546 ) ### What this PR does / why we need it? This PR (adapted from `2863befce3`) updates the CachedRequestData definition to use a single instance shared across all requests in a batch, instead of creating a new instance per request. Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53 [core.py:521] TypeError: 'CachedRequestData' object is not iterable`, Modify the model_runner to fix it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pass ci will verify this. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-02 06:05:21 +08:00
Shanshan Shen	8013634e9c	[Structured Output] Remove redundant check for `grammar_bitmask` (#1459 ) ### What this PR does / why we need it? Remove redundant check since we have check this at https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-30 17:39:19 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
Zhu Yi Lin	b308a7a258	support pangumoe w8a8c8 and docs (#1477 ) ### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com>	2025-06-28 18:51:07 +08:00
Mengqing Cao	5f4391652f	[PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483 ) ### What this PR does / why we need it? Support prompt logprobs in V1. This also enable lm_eval to test accuracy on V1 ### Does this PR introduce _any_ user-facing change? support prompt logprobs output ### How was this patch tested? CI passed with accuracy test. Using lm_eval, which use prompt logprobs as output to test accuracy, to test: ```python VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \ --tasks ceval-valid_computer_network \ --batch_size 8 ``` After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct` on V1 is: ```bash \| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\| \|----------------------------\|------:\|------\|-----:\|--------\|---\|-----:\|---\|-----:\| \|ceval-valid_computer_network\| 2\|none \| 0\|acc \|↑ \|0.7368\|± \|0.1038\| \| \| \|none \| 0\|acc_norm\|↑ \|0.7368\|± \|0.1038\| ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1043 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-28 09:38:52 +08:00
yiz-liu	2690697caa	[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416 ) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: https://github.com/vllm-project/vllm-ascend/issues/1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-26 09:27:43 +08:00
wangxiyuan	ca884ef86d	[Misc] Clean up uesless code for LLM initialize (#1373 ) This PR aims to clean up the useless code for LLM setup. It helps to make the code more clear. 1. remove useless `self.xxx` property 2. change `set_random_seed` to `seed_everything` 3. remove `set_custom_all_reduce`, it's only used for cuda This is just a code clean. no change for any code logic. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 16:20:14 +08:00
Li Wang	5f5800ba42	[Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399 ) ### What this PR does / why we need it? Sync MRotaryEmbedding interface change to recover main CI (https://github.com/vllm-project/vllm/pull/19939) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-24 22:56:39 +08:00
Yikun Jiang	097e7149f7	[Platform] Add initial experimental support for Altlas 300I series (#1333 ) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - https://github.com/vllm-project/vllm-ascend/pull/914 - https://github.com/vllm-project/vllm-ascend/pull/1318 - https://github.com/vllm-project/vllm-ascend/pull/1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322 - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。", "水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <farawayboat@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: Vincent Yuan <farawayboat@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-21 09:00:16 +08:00
yuancaoyaoHW	00ae250f3c	[V1][eagle3] Support eagle3 proposer for v1 (#1032 ) ### What this PR does / why we need it? This PR implements the Eagle Pososer feature for vLLM v1, which enables more efficient speculative decoding by using a draft model to predict potential future tokens. - The implementation includes the core Eagle algorithm integration with vLLM's existing architecture, allowing for faster inference while maintaining output quality. - This is needed to significantly improve the generation speed of large language models without compromising on the quality of generated text. ### Does this PR introduce any user-facing change? Yes, this PR introduces a new speculative decoding mode that can be enabled via configuration. - Users can now choose to use Eagle Pososer by setting appropriate flags in the inference configuration. - The API remains backward compatible, with the new functionality being opt-in. ### How was this patch tested? CI passed with new unit tests added for the Eagle Pososer functionality. - Benchmark tests were conducted comparing generation speed and quality with and without Eagle Pososer. - Integration tests were performed with various model architectures to ensure compatibility. - Manual testing was done using different prompt scenarios to verify output quality remains consistent. - we test accept rate on one Ascend 910B npu, The acceptance rate results are basically consistent with those shown here: https://github.com/vllm-project/vllm/pull/16937 - Currently, we support scenarios where num_spec_tokens <= 2. When num_spec_tokens > 2, issues such as insufficient GPU memory and operator computation errors may occur. We will address this in subsequent updates. - We will add support for Eagle v1 in future updates. ### Acceptance Test Script ```bash SCRIPT="/offline/eagle.py" DATASET="ShareGpt" MODEL=Meta-Llama-3.1-8B-Instruct DRAFT=EAGLE3-LLaMA3.1-Instruct-8B CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \ --dataset $DATASET \ --num_spec_tokens 2 \ --max_num_seqs 1 \ --model_dir $MODEL \ --eagle_dir $DRAFT \ --tp 1 \ --num_prompts 80 ``` ### Acceptance Test Results ```bash ██████████████████████████████████████████████████████████████████████████████████████████████████████████\| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s] ------------------------------------------------------------------------------------- mean acceptance length: 1.63 ------------------------------------------------------------------------------------- total_counts: 8062 acceptance at token 0: 1.00 (8062 times) acceptance at token 1: 0.70 (5612 times) acceptance at token 2: 0.47 (3765 times) ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1004 --------- Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>	2025-06-20 17:19:54 +08:00
yiz-liu	2c7dd85fd8	[Fix] Fix the token-wise padding mechanism (#1300 ) ### What this PR does / why we need it? Fix the token-wise padding mechanism. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-20 14:46:17 +08:00
wangxiyuan	b350edae9a	[UT] refactor test_expert_load_balancer and fix broken CI (#1293 ) refactor test_expert_load_balancer to keep the ut code style This PR also fixed the break change from https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280 This is just a quick fix. We'll support embedding on V1 later Closes: https://github.com/vllm-project/vllm-ascend/issues/1299 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 01:02:52 +08:00
zzzzwwjj	23ca68d0c8	[refactor] Refactoring AscendFusedMoE (#1229 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR is used for resolved [issue 1147](https://github.com/vllm-project/vllm-ascend/issues/1147) 1. Move fused_moe code into one file `fused_moe.py`. 2. Integrate branch conditions into function `get_fused_moe_state`. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? 1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this env is useless, we can make judgments based on the current scenario without this env, it will only increase complexity. 2. This PR has removed the env `USING_LCCL_COM`, because this env has already expired. 3. `additional_config.expert_tensor_parallel_size` has already expired, and now we also use parameter `enable_expert_parallel`, consistent with the vLLM. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-17 17:49:03 +08:00
fems14	ab5d110fcc	vllm-ascend support chunked prefill (#1172 ) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-06-14 22:31:16 +08:00
wangxiyuan	4f5964420e	[CI] Upgrade vllm to 0.9.1 (#1165 ) 1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now. keep doc to 0.9.0 until we release the first 0.9.1 release. 2. disable V0 test for PR 3. move actionlint check to lint job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-11 16:33:11 +08:00
depeng1994	860a5ef7fd	provide an e2e guide for execute duration profiling (#1113 ) ### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com>	2025-06-11 10:02:11 +08:00

1 2

94 Commits