xc-llm-ascend

Author	SHA1	Message	Date
Yikun Jiang	17a430f7b8	Upgrade vLLM to v0.10.0 (#1927 ) ### What this PR does / why we need it? - Upgrade to v0.10.0 - Drop v0.9.2 version compatibility - Add patch for `vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py` as workaround of `f3a683b7c9` for v0.10.0 and also add e2e test `test_models_prompt_logprobs` - Pin transformers<4.54.0 as workaround of https://github.com/vllm-project/vllm-ascend/issues/2034 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test locally: `VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs` - CI passed - vLLM version: v0.9.2 - vLLM main: `7728dd77bb` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 15:43:29 +08:00
li chaoran	ff97740b8d	Use mirror images (#1912 ) ### What this PR does / why we need it? More discussion can be found [here](https://github.com/ascend-gha-runners/docs/issues/23). The infra team deployed a internal registry since both `m.daocloud.io` and `quay.io` suffered a unstable connect quality. CI will benefit both the connection and download speed by switching to the internal registry. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? tested locally - vLLM version: v0.9.2 - vLLM main: `6b46c4b653` --------- Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>	2025-07-24 10:47:05 +08:00
li chaoran	3e39d7234c	[CI] Switching to infra cache server to reduce network pressure (#1792 ) ### What this PR does / why we need it? This PR introduce the infra cache server to speed up apt/pip package installation ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Tested locally, with this config, the network bandwith reduce from 100% to 5% usage when a new PR was submitted. <img width="807" height="334" alt="image" src="https://github.com/user-attachments/assets/16f03bce-4531-4c71-ab6e-8308dc2c022c" /> - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>	2025-07-18 18:39:25 +08:00
wangxiyuan	bf2549856f	[CI] Fix changes CI to recover codecov (#1799 ) Add `checkout` action before `dorny/paths-filter` to make it works with `push` case. This is a known issue that `dorny/paths-filter` works without `checkout` in `pull_request` case but failed in `push` case. More detail is here: https://github.com/dorny/paths-filter/issues/60#issuecomment-1464281021 The push CI works after this PR. The test result is here: https://github.com/wangxiyuan/vllm-ascend/actions/runs/16285606468/job/45983607539 - vLLM version: v0.9.2 - vLLM main: `d4d309409f` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 15:01:13 +08:00
wangxiyuan	787010a637	[Test] Remove VLLM_USE_V1 in example and tests (#1733 ) V1 is enabled by default, no need to set it by hand now. This PR remove the useless setting in example and tests - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 12:49:57 +08:00
wangxiyuan	011fd73a48	[CI] Make CI tracker more clear (#1720 ) 1. enable lint check for all change 2. only run ut and e2e if it's the code change. 3. only run ut and disable e2e if the change is ut only. 4. disable wheel build for push case 5. run unit test when pr is merged 6. remove useless pytest.ini - vLLM version: v0.9.2 - vLLM main: `fdfd409f8f` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-10 16:03:23 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00
wangxiyuan	830332ebfc	Clean up v0.9.1 code (#1672 ) vllm has released 0.9.2. This PR drop 0.9.1 support. - vLLM version: v0.9.1 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:52:24 +08:00
Yikun Jiang	e4e9ea02ab	Upgrade vLLM version to v0.9.2 (#1652 ) ### What this PR does / why we need it? This patch upgrade vLLM version to v0.9.2, this patch didn't remove the v0.9.1 compatible code to easy review. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `14601f5fba` - Accuracy test with 0.9.2: https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-08 14:18:17 +08:00
Mengqing Cao	f2a20393a2	[CI] Fix mypy check in CI (#1655 ) ### What this PR does / why we need it? Fix mypy check in CI: https://github.com/vllm-project/vllm-ascend/actions/runs/16115919385/job/45469646509?pr=1654 Mypy failed due to the greater numpy version. We need to pin `numpy=1.26.4` in vllm-ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-07 20:19:16 +08:00
Mengqing Cao	dd22ac38b2	[CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136 ) ### What this PR does / why we need it? 1. run deepseek acc ut per pr --- multicard CI time increased by 9 min 2. run spec decode e2e test on v1 per pr --- singlecard CI time increased by 3 min (partly is disabled due to not work now) ~~3. align the output of whether dbo is enabled or not~~ The generated results with and without dbo cannot be aligned. https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136 4. skip V0 mtp test due to failure in https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816 5. fix some version conflicts ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-04 18:05:45 +08:00
zhangxinyuehfad	4e910186de	[CI/UT] Unify model usage via ModelScope in CI (#1207 ) ### What this PR does / why we need it? Unify Model Usage via ModelScope ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-04 10:52:17 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
sharonyunyun	941269a6c5	adjusting the communication method in graph mode (#1194 ) ### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group，to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. Before Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003) Evaluation ![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7) ![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057) After Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e) Evaluation ![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0) ![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4) ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>	2025-06-25 19:56:49 +08:00
Li Wang	15df8be937	[Doc] Add sleep mode doc (#1295 ) ### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-25 14:07:14 +08:00
Mengqing Cao	52317f92cb	[DP] Tiny fix of dp and update example (#1273 ) ### What this PR does / why we need it? Add `max_num_tokens_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg `max_num_tokens_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-25 11:03:04 +08:00
zxdukki	f04c6763d8	[Bugfix] fix env variable in dbo (#1284 ) ### What this PR does / why we need it? Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we have fixed an known issue in deepseek-dbo. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? This patch can be tested with newly added e2e tests: [tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e). It can be verified with pytest. --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>	2025-06-23 09:07:57 +08:00
Shanshan Shen	21fb68a03a	[CI] Update guided decoding ut (#1312 ) ### What this PR does / why we need it? Update guided decoding ut. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-23 09:06:20 +08:00
Yikun Jiang	a95afc011e	[CI] Enable merge trigger unit test and accuracy test schedule job (#1345 ) ### What this PR does / why we need it? - Enable merge trigger unit test and accuracy test schedule job - Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 17:21:57 +08:00
Yikun Jiang	2009fdb8da	[Test] Enable code cov for V1 and enable push trigger (#1164 ) ### What this PR does / why we need it? - Enable code cov for V1 - Enable push triggered job ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-21 00:01:05 +08:00
Mengqing Cao	96fa7ff63b	[DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235 ) ### What this PR does / why we need it? 1. Fix rank set in DP scenario. The new poc version of torch-npu support setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the rank set in `DPEngineCoreProc` directly instead of calculating local rank across dp by hand in the patched `_init_data_parallel` Closes: https://github.com/vllm-project/vllm-ascend/issues/1170 2. Bump torch-npu version to 2.5.1.post1.dev20250528 Closes: https://github.com/vllm-project/vllm-ascend/pull/1242 Closes: https://github.com/vllm-project/vllm-ascend/issues/1232 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-06-16 23:09:53 +08:00
wangxiyuan	69b817ed65	[CI] Add unit test framework (#1201 ) This PR added the unit test framework to enable ut for vLLM Ascend. Unit test runs on CPU machines. It'll be ran once lint check is passed the same as e2e test. For unit test, this PR created a new folder called `ut` under `tests` module. All the test file in `ut` should keep the same with the code in `vllm-ascend`. The file name should be start with `test_` prefix. For example, in this PR. the `test_ascend_config.py` is added for `ascend_config.py` test. A new fille `worker/test_worker_v1.py` is also added as the placeholder. This file should be the unit test for `vllm-ascend/worker/worker_v1.py`. Additional, a new `fake_weight` folder is added, it contains the config.json from `facebook/opt-125m`, so that the test will not always visit huggingface. TODO: We should add all the unit test file one by one in the future. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-16 18:32:28 +08:00
Yikun Jiang	4ce860a2be	[CI] Make e2e test to be preemptible and simple (#1217 ) ### What this PR does / why we need it? This PR make e2e test to be simple, even bring some repeat code between single card and multicard, but we will not struggle with across max-parallel, matrix and concurrency: 1. This PR make e2e test to be preemptible and simple: - lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel) - Anytime you push another PR will cancel previous job, whatever the job is lint / e2e / multi-cards 2. Use Modelscope rather than hf-mirror 3. Resolve some error like `Canceling since a higher priority waiting request for pr-XXXX-limit-npu-4 exists` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - lint ---> e2e (2 parallel) ---> e2e multi-card (1 parallel) - e2e test will canceled by update patch --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-15 22:07:43 +08:00
whx	47b507b180	[CI] Recover ut for ascend scheduler only in ci of v1. (#1180 ) Last PR [#943 ](https://github.com/vllm-project/vllm-ascend/pull/943) wrongly open ut of AscendScheduler in V0 ci, this PR fixes this problem and only run ut of it in V1 ci. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-13 07:51:23 +08:00
whx	3393d53b36	[Scheduler][MTP] Add support for speculative decoding in AsecendScheduler. (#943 ) This PR adds support for speculative decoding in AsecendScheduler. Also inculde part of support for disaggregated prefill, full support will be merged in follow-up PR. --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-11 20:55:44 +08:00
wangxiyuan	4f5964420e	[CI] Upgrade vllm to 0.9.1 (#1165 ) 1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now. keep doc to 0.9.0 until we release the first 0.9.1 release. 2. disable V0 test for PR 3. move actionlint check to lint job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-11 16:33:11 +08:00
sdmyzlp	7bdc606677	Support multistream of shared experts in FusedMoE (#997 ) Contains on #1111 for completeness. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` \| shared gate_up \| shared act \| \| shared down \| \| dispatch \| routed gate_up, act, down \| combine \| ``` <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-11 09:18:38 +08:00
wangxiyuan	95414bae70	[CI] Run e2e after pre check pass (#1132 ) Make sure the lint test passed before start the e2e test to save compute resource. Updated the patch doc to make sure the CI works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-10 17:18:09 +08:00
Yikun Jiang	8d00775fce	[SpecDecode][CI] Set default values to fix spec decode and fix multicard CI (#1109 ) ### What this PR does / why we need it? - Set default values to fix spec decode - To avoid oom, we need to run the test in a single process ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed, espcecially multicards CI - For spec decode test, long term CI passed Closes: https://github.com/vllm-project/vllm-ascend/pull/1105 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>	2025-06-07 11:23:30 +08:00
Li Wang	a2552e10e4	[Worker][V1] Support sleep mode for v1 (#1084 ) ### What this PR does / why we need it? Support sleep mode for v1 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 21:54:02 +08:00
Li Wang	11a7df4270	[ModelRunner] Support embedding inputs (#916 ) ### What this PR does / why we need it? - Adds support for passing prompt_embeds to LLM.generate as ```bash llm.generate({"prompt_embeds": input_embeds}, sampling_params) ``` or ```bash llm.generate( [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params ) ``` - Add `prompt_embeds` to examples ### How was this patch tested? CI passed with new added/existing test. and I have test with the example script in this pr, and the output seems looks good: ```bash [Single Inference Output] ------------------------------ The capital of France is Paris. Paris is the largest city in France and is ------------------------------ Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3966.87it/s] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s] [Batch Inference Outputs] ------------------------------ Q1: Please tell me about the capital of France. A1: The capital of France is Paris. It is located in the northern part of the Q2: When is the day longest during the year? A2: The day is longest during the year at the summer solstice. This typically occurs Q3: Where is bigger, the moon or the sun? A3: The sun is significantly bigger than the moon. The sun has a diameter of ------------------------------ ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 20:21:13 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
Li Wang	517811449e	[CI] Re-enable sleep mode test and skip failure breaking CI (#990 ) ### What this PR does / why we need it? - Re-enable sleep mode test - Fix nightly performance benchmark workflow - Fix model-runner-v1 bug for upstream [change](https://github.com/vllm-project/vllm/pull/18654) --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-04 16:24:16 +08:00
Yikun Jiang	92bc5576d8	Skip benchmarks/ in vllm ascend test (#1041 ) ### What this PR does / why we need it? Skip benchmarks/ in vllm ascend test to reduce CI cost ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-01 19:01:26 +08:00
NeverRaR	507ae627ca	feat: support compile torchair graph while warming up (#839 ) ### What this PR does / why we need it? feat: support compile torchair graph while warming up Signed-off-by: boying <897013703@qq.com>	2025-05-31 06:03:03 +08:00
wangxiyuan	f6e5decc10	[CI] upgrade to vllm 0.9.0 (#959 ) Upgrade to vllm 0.9.0. 0.8.5 will not be supported any more. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-28 21:18:41 +08:00
wangxiyuan	e2a0c19cea	[CI] Refactor CI (#952 ) 1. remove some useless test func and file 2. fix format.sh problem 3. enable full test for singlecard and multicard 4. move long term test to long_term folder. For this kind of test, it only runs by labeled and daily test. Include: spec decode、accuracy test ## After refactor: There are 4 test modules - `singlecard`: contains the test running on one NPU. It'll be run for each PR and daily test. - `multicard`: contains the test running on multi NPUs. It'll be run for each PR and daily test. - `long_term`: contains the test that cost much time(Now include `spec decode` and `accuracy` test). It'll be run for the PR with `long-term-test` labeled and daily test. - `e2e`: contains the test for doc and pd feature. It'll be run for the PR with `pd-test` labeled and daily test. ## Todo: 1. some test are skipped, they should be fixed and reenabled in the future. 2. pyhccl test for multicard doesn't work at all. It should be enabled as well. 3. ensure long-term-test pass by daily test. ### Know issue Now, `ready` labels is required to start pd test or long term test. And when `long-term-test` or `pd-test` is labeled after another one, the old labeled test will be re-run again. So the labeled test should be ran in the following step: 1. decide which test need run, then label it. `long-term-test` or `pd-test` or both. 2. add `ready-for-test` label, then the test will be ran. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-28 06:31:35 +08:00
jiangpeng	df58fb80ee	Spec decode support for V1 Engine (#874 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Make spec decode support for V1 Engine - Currently, Ascend does not support the triton kernel. PyTorch is used to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is not as good as Triton. Therefore, ascend c is used to implement the function in the future. - Currently, spec decode supports only the ngram algorithm. The eagle algorithm needs to be further adapted. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> Not change user facing. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and `tests/sample/test_rejection_sampler.py`, test base function of rejection sampler and e2e function of spec decode. Signed-off-by: ponix-j <657511300@qq.com>	2025-05-23 14:25:46 +08:00
yupeng	0f53b138f6	[V1][LoRA][Test] V1 Engine LoRA support & e2e test (#893 ) ### What this PR does / why we need it? Add V1Engine LoRA support. Add LoRA e2e test on single card and multiple cards. ### Does this PR introduce _any_ user-facing change? support lora for V1 ### How was this patch tested? CI passed with new added test --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: paulyu <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: jesse <szxfml@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-05-22 19:20:51 +08:00
Li Wang	8e4e791fcd	[CI] Add deepseek-v2-lite test (#631 ) ### What this PR does / why we need it? Add deepseek-v2-lite test, part of #499 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-05-12 14:59:17 +08:00
wemaster	19c8e134e4	[CI/UT] fix spec ut in vllm-ascend main and vllm main (#759 ) ### What this PR does / why we need it? #### 1. fix spec ut in vllm-ascend main and vllm main As https://github.com/vllm-project/vllm-ascend/pull/694 and https://github.com/vllm-project/vllm-ascend/pull/749 verify, Now, vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main and vllm main, CI is fail. I found the reason is a triton bug https://github.com/triton-lang/triton/issues/2266, but i I didn't figure it out that why the bug did not effect vllm-ascend main and vllm 0.8.5, maybe the usage of triton have changed when vllm 0.8.5 to latest main As the bug describe, I changed the minimum block_size in UT from 8 to 16, and the modification is verified locally to be effective. #### 2. modify some case skip form. I modified some commented out cases to skipif form, which is more standardized. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-05-10 09:45:56 +08:00
Li Wang	58d2f85c4a	[CI] Fix schedule trigger bug (#757 ) ### What this PR does / why we need it? This PR aims to fix nightly ci [broken](https://github.com/vllm-project/vllm-ascend/actions/runs/14848150987) We have a workflow containing multiple triggers: - push events (to the default branch) - pull requests (against the default branch) - scheduled events Our paths-filter action works great for the first two use-cases, detecting the context and base to compare against. However, it fails for scheduled events giving the error `This action requires 'base' input to be configured or 'repository.default_branch' to be set in the event payload.` For the scheduling trigger event, we choose to skip this filter because we don't need its results: ``` - name: Check for changes in Speculative Decode if: github.event_name != 'schedule' ``` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-05-10 09:45:07 +08:00
Yikun Jiang	5897dc5bbe	[Build] Bump vLLM version to v0.8.5.post1 (#755 ) ### What this PR does / why we need it? Bump vllm version to v0.8.5.post1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-06 11:44:12 +08:00
Yikun Jiang	79538b5d73	Upgrade CANN version to 8.1.rc1 (#747 ) ### What this PR does / why we need it? Make CANN version bump separately from https://github.com/vllm-project/vllm-ascend/pull/708 - Upgrade CANN version to 8.1.rc1 - Add prefix to speed up download `m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10` - Address tail sapce for Dockerfile.openEuler - Add note for `/workspace` and `/vllm-workspace` as followup of https://github.com/vllm-project/vllm-ascend/pull/741 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-05-06 05:44:18 +08:00
Yikun Jiang	d2ead057ae	Re-enable Speculative Decode test for vLLM v0.8.5 (#749 ) ### What this PR does / why we need it? Re-enable Speculative Decode test for vLLM v0.8.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-02 14:44:48 +08:00
whx	8b194ad12e	[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694 ) ### What this PR does / why we need it? - This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer. - This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now. - Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>	2025-05-01 22:31:36 +08:00
wangxiyuan	f8350569e6	[CI] upgrade vllm to 0.8.5 (#715 ) 1. Upgrade vllm to 0.8.5 2. Drop 0.8.4 support 3. Keep doc to 0.8.4rc2 until we release 0.8.5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-30 09:15:50 +08:00
Li Wang	7aee9228f0	[CI] Add nightly CI (#668 ) ### What this PR does / why we need it? Add nightly CI for basic function and model usability --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-29 16:35:52 +08:00
wemaster	54c0e63df7	[MTP] follow custom deepseek modeling changes to support graph mode (#636 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-28 21:18:53 +08:00

1 2

80 Commits