xc-llm-ascend

Author	SHA1	Message	Date
Pleaplusone	7e6efbf2a9	update torch-npu to 2.5.1.post1.dev20250619 (#1347 ) ### What this PR does / why we need it? This PR update the torch_npu to newest release version 2.5.1.post1.dev20250619 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI tested will guarantee the update Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-06-23 09:02:09 +08:00
xleoken	4447e53d7a	[Doc] Change not to no in faqs.md (#1357 ) ### What this PR does / why we need it? Change not to no in faqs.md. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test Signed-off-by: xleoken <xleoken@163.com>	2025-06-23 09:01:00 +08:00
Yikun Jiang	a95afc011e	[CI] Enable merge trigger unit test and accuracy test schedule job (#1345 ) ### What this PR does / why we need it? - Enable merge trigger unit test and accuracy test schedule job - Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 17:21:57 +08:00
Yikun Jiang	2e5f312530	Cleanup ununsed doc (#1352 ) ### What this PR does / why we need it? Cleanup ununsed doc for MoGE model, we will add back this when MoGE model ready. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 15:05:30 +08:00
Yikun Jiang	c30ddb8331	Bump v0.9.1rc1 release (#1349 ) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-22 13:15:36 +08:00
Yikun Jiang	097e7149f7	[Platform] Add initial experimental support for Altlas 300I series (#1333 ) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - https://github.com/vllm-project/vllm-ascend/pull/914 - https://github.com/vllm-project/vllm-ascend/pull/1318 - https://github.com/vllm-project/vllm-ascend/pull/1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322 - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。", "水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <farawayboat@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: Vincent Yuan <farawayboat@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-21 09:00:16 +08:00
Yikun Jiang	2009fdb8da	[Test] Enable code cov for V1 and enable push trigger (#1164 ) ### What this PR does / why we need it? - Enable code cov for V1 - Enable push triggered job ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-21 00:01:05 +08:00
Angazenn	2f1266d451	Support Pangu Pro MoE model (#1204 ) ### What this PR does / why we need it? Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411) ### Does this PR introduce _any_ user-facing change? Yes, new model supported ### How was this patch tested? Test locally --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-20 23:59:59 +08:00
yuancaoyaoHW	00ae250f3c	[V1][eagle3] Support eagle3 proposer for v1 (#1032 ) ### What this PR does / why we need it? This PR implements the Eagle Pososer feature for vLLM v1, which enables more efficient speculative decoding by using a draft model to predict potential future tokens. - The implementation includes the core Eagle algorithm integration with vLLM's existing architecture, allowing for faster inference while maintaining output quality. - This is needed to significantly improve the generation speed of large language models without compromising on the quality of generated text. ### Does this PR introduce any user-facing change? Yes, this PR introduces a new speculative decoding mode that can be enabled via configuration. - Users can now choose to use Eagle Pososer by setting appropriate flags in the inference configuration. - The API remains backward compatible, with the new functionality being opt-in. ### How was this patch tested? CI passed with new unit tests added for the Eagle Pososer functionality. - Benchmark tests were conducted comparing generation speed and quality with and without Eagle Pososer. - Integration tests were performed with various model architectures to ensure compatibility. - Manual testing was done using different prompt scenarios to verify output quality remains consistent. - we test accept rate on one Ascend 910B npu, The acceptance rate results are basically consistent with those shown here: https://github.com/vllm-project/vllm/pull/16937 - Currently, we support scenarios where num_spec_tokens <= 2. When num_spec_tokens > 2, issues such as insufficient GPU memory and operator computation errors may occur. We will address this in subsequent updates. - We will add support for Eagle v1 in future updates. ### Acceptance Test Script ```bash SCRIPT="/offline/eagle.py" DATASET="ShareGpt" MODEL=Meta-Llama-3.1-8B-Instruct DRAFT=EAGLE3-LLaMA3.1-Instruct-8B CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \ --dataset $DATASET \ --num_spec_tokens 2 \ --max_num_seqs 1 \ --model_dir $MODEL \ --eagle_dir $DRAFT \ --tp 1 \ --num_prompts 80 ``` ### Acceptance Test Results ```bash ██████████████████████████████████████████████████████████████████████████████████████████████████████████\| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s] ------------------------------------------------------------------------------------- mean acceptance length: 1.63 ------------------------------------------------------------------------------------- total_counts: 8062 acceptance at token 0: 1.00 (8062 times) acceptance at token 1: 0.70 (5612 times) acceptance at token 2: 0.47 (3765 times) ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1004 --------- Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>	2025-06-20 17:19:54 +08:00
wangxiyuan	45be1aac0c	[CI] Add codespell check for doc (#1314 ) Add codespell check test for doc only PR Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 16:48:14 +08:00
22dimensions	761bd3d9d7	Add user guide for quantization (#1206 ) ### What this PR does / why we need it? Add user guide for quantization ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-20 15:53:25 +08:00
yiz-liu	2c7dd85fd8	[Fix] Fix the token-wise padding mechanism (#1300 ) ### What this PR does / why we need it? Fix the token-wise padding mechanism. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-20 14:46:17 +08:00
wangxiyuan	b350edae9a	[UT] refactor test_expert_load_balancer and fix broken CI (#1293 ) refactor test_expert_load_balancer to keep the ut code style This PR also fixed the break change from https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280 This is just a quick fix. We'll support embedding on V1 later Closes: https://github.com/vllm-project/vllm-ascend/issues/1299 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 01:02:52 +08:00
songshanhu07	ebb2a70dbb	static EPLB fix bug, add unit test (#1186 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1.add static EPLB unit test 2.fix bug: Tensor cannot be directly judged by if statements ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Run the unit test. --------- Signed-off-by: songshanhu07 <1763685535@qq.com>	2025-06-18 19:46:56 +08:00
Shanshan Shen	2cd8ecdc4f	[Bugfix][Spec Decode] Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode (#1258 ) ### What this PR does / why we need it? Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode. Find more details at mengwei805's comment in https://github.com/vllm-project/vllm-ascend/pull/1123. ### Does this PR introduce _any_ user-facing change? The user will not be aware of `VLLM_ASCEND_ACL_OP_INIT_MODE` (`ACL_OP_INIT_MODE`). ### How was this patch tested? Test scripts: ```python from vllm import LLM, SamplingParams prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="Qwen/Qwen2.5-1.5B-Instruct", tensor_parallel_size=1, speculative_config={ "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4, }, ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Results: ``` Adding requests: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 76.70it/s] Processed prompts: 100%\|███████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 1.33it/s, est. speed input: 6.64 toks/s, output: 21.26 toks/s] Prompt: 'The future of AI is', Generated text: ' bright\n\n04/15/2020\n\nBy: James' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-18 17:50:20 +08:00
zzzzwwjj	db2f630aeb	[bugfix] fix deepseek with mc2 (#1268 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-18 00:58:38 +08:00
whx	d7e19ed57a	[BugFix] fix length of sin/cos cache in rope (#1266 ) This PR fixes the bug that constructs shorter sin/cos cache than model's max positional embedding. Closes: https://github.com/vllm-project/vllm-ascend/issues/1038 Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-17 23:14:25 +08:00
Jade Zheng	afc8edb046	[Bugfix]: Pass scaling args to mc2 (#1202 ) Pass `expert_scale` and `expand_scale` args to the dispatch and combine functions. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-06-17 22:16:44 +08:00
Li Wang	f8029945c3	[Bugfix] Remove cuda related lines and add additional pip mirror (#1252 ) ### What this PR does / why we need it? - For npu environment, we should use `PYTORCH_NPU_ALLOC_CONF ` rather than `PYTORCH_CUDA_ALLOC_CONF` - Add `PIP_EXTRA_INDEX_URL` to make nightly_benchmarks happy --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-17 21:25:40 +08:00
zzzzwwjj	23ca68d0c8	[refactor] Refactoring AscendFusedMoE (#1229 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR is used for resolved [issue 1147](https://github.com/vllm-project/vllm-ascend/issues/1147) 1. Move fused_moe code into one file `fused_moe.py`. 2. Integrate branch conditions into function `get_fused_moe_state`. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? 1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this env is useless, we can make judgments based on the current scenario without this env, it will only increase complexity. 2. This PR has removed the env `USING_LCCL_COM`, because this env has already expired. 3. `additional_config.expert_tensor_parallel_size` has already expired, and now we also use parameter `enable_expert_parallel`, consistent with the vLLM. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-17 17:49:03 +08:00
Yikun Jiang	05dec7eda9	[Doc] Refactor and init user story page (#1224 ) ### What this PR does / why we need it? This PR refactor the user stories page: - Move it to community - Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo, GPUStack, verl - Add a new page for LLaMA-Factory ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview locally Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-17 09:36:35 +08:00
Yikun Jiang	9d3cbc0953	[Doctest] add installation doctest (#1179 ) ### What this PR does / why we need it? Install doctest ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Related: https://github.com/vllm-project/vllm-ascend/pull/983 Co-authored-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-06-17 08:52:26 +08:00
Mengqing Cao	96fa7ff63b	[DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235 ) ### What this PR does / why we need it? 1. Fix rank set in DP scenario. The new poc version of torch-npu support setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the rank set in `DPEngineCoreProc` directly instead of calculating local rank across dp by hand in the patched `_init_data_parallel` Closes: https://github.com/vllm-project/vllm-ascend/issues/1170 2. Bump torch-npu version to 2.5.1.post1.dev20250528 Closes: https://github.com/vllm-project/vllm-ascend/pull/1242 Closes: https://github.com/vllm-project/vllm-ascend/issues/1232 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-06-16 23:09:53 +08:00
zhuo97	f5404dc650	Fix the device error when using ray as vllm-acend backend (#884 ) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com>	2025-06-16 21:03:16 +08:00
wangxiyuan	69b817ed65	[CI] Add unit test framework (#1201 ) This PR added the unit test framework to enable ut for vLLM Ascend. Unit test runs on CPU machines. It'll be ran once lint check is passed the same as e2e test. For unit test, this PR created a new folder called `ut` under `tests` module. All the test file in `ut` should keep the same with the code in `vllm-ascend`. The file name should be start with `test_` prefix. For example, in this PR. the `test_ascend_config.py` is added for `ascend_config.py` test. A new fille `worker/test_worker_v1.py` is also added as the placeholder. This file should be the unit test for `vllm-ascend/worker/worker_v1.py`. Additional, a new `fake_weight` folder is added, it contains the config.json from `facebook/opt-125m`, so that the test will not always visit huggingface. TODO: We should add all the unit test file one by one in the future. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-16 18:32:28 +08:00
Yikun Jiang	966557a2a3	[Build] Speedup image build (#1216 ) ### What this PR does / why we need it? 1. Rename workflow name to show OS info 2. Speedup image build: - PR: only arm64 build on openEuler arm64, only amd64 build on Ubuntu amd64 - Push/Tag: still keep origin logic use qemu on amd64 This PR actually drop the e2e image build per PR but I think it's fine consider it's stable enough, if we still meet some problem we can revert this PR 43-44mins ---> about 8-10 mins ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-16 09:02:53 +08:00
Yikun Jiang	4ce860a2be	[CI] Make e2e test to be preemptible and simple (#1217 ) ### What this PR does / why we need it? This PR make e2e test to be simple, even bring some repeat code between single card and multicard, but we will not struggle with across max-parallel, matrix and concurrency: 1. This PR make e2e test to be preemptible and simple: - lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel) - Anytime you push another PR will cancel previous job, whatever the job is lint / e2e / multi-cards 2. Use Modelscope rather than hf-mirror 3. Resolve some error like `Canceling since a higher priority waiting request for pr-XXXX-limit-npu-4 exists` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel) - e2e test will canceled by update patch --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-15 22:07:43 +08:00
ttanzhiqiang	4270682383	Waiting for BMM NZ support(Improve TPOP 2ms performance) (#1131 ) ### What this PR does / why we need it? W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. ### Does this PR introduce _any_ user-facing change? Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms ### How was this patch tested? use #1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-06-15 19:57:02 +08:00
22dimensions	0d2074a1ec	[Doc] fix VLLM_USE_V1 value in graph mode docs (#1226 ) os.environ["VLLM_USE_V1"] must be assigned with str, not other type. ![image](https://github.com/user-attachments/assets/9d337ae5-00e5-4179-832e-c6c917dd5798) Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-15 15:41:11 +08:00
fems14	ab5d110fcc	vllm-ascend support chunked prefill (#1172 ) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-06-14 22:31:16 +08:00
Mengqing Cao	a3b5af8307	[CI/UT][Graph] Add ut for torchair graph mode (#1103 ) ### What this PR does / why we need it? Add ut for torchair graph mode on DeepSeekV3 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-06-14 16:59:00 +08:00
Yikun Jiang	94a52cf577	Add ShouJian Zheng (@jianzs) as vLLM Ascend maintainer (#1203 ) ### What this PR does / why we need it? Add @jianzs as vLLM Ascend maintainer @jianzs ---- I would like to nominate Shoujian Zheng (@jianzs <https://github.com/jianzs>) as a maintainer, starting with my +1. - He focuses on the code quality and good design with solid reviews in P/D disaggregation and DeepSeek improvement area about 30+ high quality review, such as #issuecomment-2811764833, #discussion_r2069927605 and #pullrequestreview-2820996674. This is the most important reason why I nominated him, because helping community developers complete PRs with high quality and continuously ensure the quality of codebase is one of the important responsibilities of a maintainer. We believe he is a great addition. - Shoujian's main expertise is distributed inference. He has a lot of experience in production about AI infra. He has very good habits and explains in great detail all changes #issue-3023082580 anqd share results open: #issuecomment-2853140443. And High quality PR: #706, #774, #852. - Community Involvement: Active involved in community discussion, he is collaborative and helps the users solve problems, involved in 30+ PR and issue, such as #issuecomment-2911934292 and #issuecomment-2833523571. Reference: [1] https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html [2] https://vllm-ascend.readthedocs.io/en/latest/community/governance.html Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-13 18:25:50 +08:00
whx	47b507b180	[CI] Recover ut for ascend scheduler only in ci of v1. (#1180 ) Last PR [#943 ](https://github.com/vllm-project/vllm-ascend/pull/943) wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem and only run ut of it in V1 ci. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-13 07:51:23 +08:00
sdmyzlp	e72f94e38f	Support multistream of MLA vector operations (#1135 ) ### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` \| q_rmsnorm \| \| kv_norm_rope_cache \| \| q_rope \| \| matmul W_DQ \| matmul W_DKV \| index \| index \| matmul W_UQ \| split \| matmul W_KV_T \| ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-12 21:42:09 +08:00
Wan_Danfeng	55c0e68883	[Doc] Add Referer header for CANN package download url. (#1192 ) ### What this PR does / why we need it? fix the CANN download url ### Does this PR introduce _any_ user-facing change? no, do not have any user-facing change ### How was this patch tested? run the wget command and cann package is rightly downloaded. --------- Signed-off-by: wan_danfeng <wonderful199082@126.com>	2025-06-12 21:22:23 +08:00
wangyanhui-cmss	c6e2a5fb40	[fix] fix bug in 1p1d disaggregated_prefill example (#1184 ) ### What this PR does / why we need it? fix bug in 1p1d disaggregated_prefill example ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with python find_device_ips.py and run disaggregated_prefill example <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>	2025-06-12 19:40:58 +08:00
Li Wang	37f4469a03	[CI][Benchmark] Add qwen2.5-7b test (#1104 ) ### What this PR does / why we need it? - Add qwen2.5-7b performance benchmark, this is a sub pr of #1099, for v1 test, need more verify - Fix get commit time after checkout --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-12 10:47:30 +08:00
Li Wang	dd207cb261	[CI][Benchmark] Add new model and v1 test to perf benchmarks (#1099 ) ### What this PR does / why we need it? - Add qwen2.5-7b-instruct test - Add v1 test --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-12 10:46:41 +08:00
ttanzhiqiang	2498d297ae	add custom ascendc kernel vocabparallelembedding (#796 ) This PR add custom ascendc kernel vocabparallelembedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. pytest -s benchmarks/ops/ben_vocabparallelembedding.py pytest -s tests/ops/test_vocabparallelembedding.py --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-06-12 10:44:33 +08:00
whx	3393d53b36	[Scheduler][MTP] Add support for speculative decoding in AsecendScheduler. (#943 ) This PR adds support for speculative decoding in AsecendScheduler. Also inculde part of support for disaggregated prefill, full support will be merged in follow-up PR. --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-11 20:55:44 +08:00
wangxiyuan	4f5964420e	[CI] Upgrade vllm to 0.9.1 (#1165 ) 1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now. keep doc to 0.9.0 until we release the first 0.9.1 release. 2. disable V0 test for PR 3. move actionlint check to lint job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-11 16:33:11 +08:00
chenwaner	e46dc142bf	Enable kvcache_nz for the decode process in torchair graph mode (#1098 ) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2948542159 https://github.com/vllm-project/vllm-ascend/pull/1098#issuecomment-2954496588 --------- Signed-off-by: chenwaner <861645847@qq.com>	2025-06-11 14:09:28 +08:00
yz	4153a5091b	[Doc] Fix the config parameter name "enable" in graph_mode.md. (#1159 ) Fix the doc typo in graph_mode.md Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>	2025-06-11 11:03:37 +08:00
ttanzhiqiang	980cd81466	etp best a2 (#1101 ) ### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-06-11 10:40:50 +08:00
depeng1994	860a5ef7fd	provide an e2e guide for execute duration profiling (#1113 ) ### What this PR does / why we need it? provide an e2e guide for execute duration profiling Signed-off-by: depeng1994 <depengzhang@foxmail.com>	2025-06-11 10:02:11 +08:00
sdmyzlp	7bdc606677	Support multistream of shared experts in FusedMoE (#997 ) Contains on #1111 for completeness. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` \| shared gate_up \| shared act \| \| shared down \| \| dispatch \| routed gate_up, act, down \| combine \| ``` <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-11 09:18:38 +08:00
Mengqing Cao	04abfd8721	[CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass (#1163 ) [CI] Skip test_v1_spec_decode.py::test_ngram_correctness to make longterm CI pass Related: https://github.com/vllm-project/vllm-ascend/issues/1162 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-11 07:31:13 +08:00
22dimensions	8b48daaa44	[CI] rename Qwen2.5-0.5B-Instruct-W8A8 model (#1145 ) 1. rename vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8-new to vllm-ascend/Qwen2.5-0.5B-Instruct-W8A8 Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-11 06:18:32 +08:00
Mengqing Cao	8dd686dfa2	[MLA][Graph] Improve assertion on Graph mode with MLA (#933 ) ### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-10 22:26:53 +08:00
Pleaplusone	291c216898	fix torchair execute issue on padding data, and mtp padding logic (#1160 ) ### What this PR does / why we need it? The former PR https://github.com/vllm-project/vllm-ascend/pull/736 select the valid token inside the `input_ids` and `position_ids` breaks the necessary padding required by torchair. In this PR, we pending the pad logic after the multimodal part. Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-06-10 22:20:40 +08:00

1 2 3 4 5 ...

404 Commits