xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	b75cb788dd	[Bugfix] add compilation/__init__.py to fix import error (#1152 ) 1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a python module 2. Fix model runner bug to keep the same with vllm 3. Add release note for 0.9.0rc2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-10 17:14:25 +08:00
zhangxinyuehfad	e68e81f2ce	[CI] Make accuarcy CI and report work (#1078 ) ### What this PR does / why we need it? Make accuarcy CI and report work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully review Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-10 14:35:44 +08:00
Yikun Jiang	71aee6f97d	Update 0.9.0rc1 contributors info (#1148 ) ### What this PR does / why we need it? Update 0.9.0rc1 contributors info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-10 13:29:09 +08:00
22dimensions	5cd5d64242	[CI] remove old quantization model (#1003 ) remove old quantization model, and new models will be added to testcase later. Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-10 10:07:36 +08:00
linfeng-yuan	706de02317	[fix] fix compatibility for non-EPLB scenarios (#1142 ) ### What this PR does / why we need it? Fix incompatibility problem for non-EPLB scenarios in #1116 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with online serving and e2e CI. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-06-10 08:39:24 +08:00
wangxiyuan	571f88f85e	[Doc] Update 0.9.0rc1 release date (#1139 ) 1. Update 0.9.0rc1 release date 2. Update feature and model support list 3. Add DP known issue to release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 22:51:02 +08:00
whx	cd2f14a1b3	[MTP][V1] Adapt mtp with graph mode in v1. (#1023 ) Adapts deepseek mtp with torch air graph mode in v1. --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-09 22:21:42 +08:00
wangxiyuan	5ac4872f5e	[Doc] Add 0.9.0rc1 release note (#1106 ) Add the release note for v0.9.0rc1 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 19:39:21 +08:00
Yuxiao-Xu	6b853f15fe	Add static EPLB (#1116 ) ### What this PR does / why we need it? Add EPLB expert map import capabilities ### Does this PR introduce _any_ user-facing change? When importing the EPLB expert map you need import expert map file by vllm args additional_config ### How was this patch tested? 1.You need to collect expert hotness and generate an expert placement file based on the hotness and the EPLB algorithm, or you can directly use an existing expert placement table. 2.When launching vLLM, enable EC2 and pass the configuration via the command-line argument: --additional-config '{"expert_map_path": "/xxx/xxx/xx.json"} Co-authored-by: songshanhu07 <1763685535@qq.com> --------- Signed-off-by: songshanhu07 <1763685535@qq.com> Signed-off-by: Yuxiao-Xu <664988918@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: songshanhu07 <1763685535@qq.com> Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 19:28:11 +08:00
wangxiyuan	cb341c7bcd	[CI] Fix PD job (#1129 ) Fix e2e test for Pd job Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-09 16:34:41 +08:00
Yikun Jiang	e63fc6f280	Init vLLM Ascend maintainers info (#1124 ) ### What this PR does / why we need it? As plus of https://github.com/vllm-project/vllm-ascend/pull/1070, this patch adds `Nominating and Removing Maintainers` section (reference some design from [PyTorch Governance](https://docs.pytorch.org/docs/stable/community/governance.html)) Below are key info about existing maintainers: ## @wangxiyuan: - Super active code and high quality reviewer [450+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan). - One of the top contributors, he also active contribute [50+ commits ](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+) with good quality, he dares to [refactor the code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor), which also shows his deep understanding of vllm and vllm ascend. - He leads the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162) feature, this make vllm-ascend project become true. - Active community involved cross wechat group, slack, github issue. Involved on [150+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan) and help users. He is also the spearker of vLLM Beijing meetup help more users understand vLLM Ascend. - Relase manager of [v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2), [v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1), [v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1). ## @Yikun: - High active code reviewer: [190+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun), especially for new developers to help them onboarding. - One of the top contributors with sustained contributions: [50+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+) since the first day of vLLM Ascend. - High quality contributions around vLLM compatibility guarantee and also maintain [CI ](https://github.com/vllm-project/vllm-ascend/pull/1040) and [test Framework](https://github.com/vllm-project/vllm-ascend/pull/730). - Active community involved cross local group, github issue Involved on [170+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun). He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch Day China 2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session) to help vLLM Ascend growth. - Relase manager of [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2), [v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1), [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3). ## @ganyi1996ppo - High active code and high quality reviewer: [90+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo), he has a deep understanding of Ascend operators can always find some key issues, has deeply understand of the codebase, good code quality and qualified judgement. - Major and high quality contributions: [10+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo) with high quality. - He is the main contributor of [Custom AscendC op support](https://github.com/vllm-project/vllm-ascend/pull/371), [Deepseekv3 performance optimization](https://github.com/vllm-project/vllm-ascend/pull/598). - Community Involvement‌: Involved on [11+ issue and help users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo), share [custom ops topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342) on vLLM Ascend Weekly meeting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-09 16:32:58 +08:00
Shanshan Shen	d2f87ed9cc	[Patch] Remove `spec_decode.metrics` patch (#1016 ) ### What this PR does / why we need it? Remove `spec_decode.metrics` patch as this has been resolved in https://github.com/vllm-project/vllm/pull/16983 (include in vllm `v0.9.0`). Returns a CUDA event recording when the copy is complete --after modified--> Returns a device event (NPU Event for vllm-ascend) recording when the copy is complete. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-09 15:05:11 +08:00
yiz-liu	6003afa6d2	[BugFix] Fix data parallel (#940 ) ### What this PR does / why we need it? With this PR, we can migrate to the native `data_parallel.py` in vllm examples and remove the version in vllm-ascend. At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-09 14:08:18 +08:00
Shanshan Shen	eec6068187	[Bugfix] Set `ACL_OP_INIT_MODE` env var default to `0` (#1123 ) ### What this PR does / why we need it? Set `ACL_OP_INIT_MODE` env var default to `0`, since vllm-ascend may have problems in some scenarios when setting it to `1`. Plus, the guide https://github.com/vllm-project/vllm-ascend/issues/734 has also been updated. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-09 14:07:37 +08:00
Yikun Jiang	4976b48b98	[Build] Move numba/quart to requirments and update DS baseline and sync graph typo fix (#1121 ) ### What this PR does / why we need it? 1. The dependency was introduced by https://github.com/vllm-project/vllm-ascend/pull/874 - Move numba/quart from requirements-dev to requirments - Align pyproject.toml with requirements 2. This patch also fix deepseek accuracy baseline which https://github.com/vllm-project/vllm-ascend/pull/1118 was not addressed. According to https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite the gsm8k is about `41.1` 3. This also sync the vLLM upstream changes: `eaa2e51088` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed vllm ascend test (basic workflow) vllm longterm test (spec decode) Closes: https://github.com/vllm-project/vllm-ascend/issues/1120 --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-08 22:33:37 +08:00
zzzzwwjj	f1543d5e0d	[bugfix] fix deeepseek accuracy (#1118 ) ### What this PR does / why we need it? fix deeepseek accuracy in mix-parallel case. Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-07 21:11:36 +08:00
wangxiyuan	c8742146d3	[CherryPick] Add unpadded Qwen2.5-VL for verl scenario (#1095 ) Add unpadded Qwen2.5-VL for verl scenario. When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL` (default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors. This is cherry-picked from 0.7.3-dev Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Shanshan Shen <467638484@qq.com>	2025-06-07 19:45:46 +08:00
linfeng-yuan	b80a484864	Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE (#1112 ) ### What this PR does / why we need it? Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-06-07 19:45:33 +08:00
TaoYu Chen	20dedba5d1	Add qwen2.5 vl multimodal feature for vllm-ascend v1 (#736 ) ### What this PR does / why we need it? The current vllm-ascend is not support the multimodal model in vllm-ascend v1 yet. So I change the `model_runner_v1.py` file with using MRoPE feature and so on to support this feature. It currently still not perfect since the Ascend operator is not support the `window/full attn` to reduce Memcpy operations, so it would out of memory if the input embedding is too large, so We can't use `self._profile_multimodal()` for profile since it use a big dummy input (i.e. images) as the multimodal input. Fixes: https://github.com/vllm-project/vllm-ascend/issues/514 ### Does this PR introduce _any_ user-facing change? No, this feature not need change the user-facing ### How was this patch tested? I test this offline using my machine 910B3 and my own fork, and it works well. --------- Signed-off-by: cty <ctynb@qq.com>	2025-06-07 16:53:19 +08:00
zxdukki	87ebaef4e4	[perf]: support dual-batch overlap(dbo) for deepseek (#941 ) ### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT：8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>	2025-06-07 16:46:58 +08:00
sdmyzlp	3640c60b0e	Avoid unfused Transpose in DeepSeekV3 EP256 MoE layer (#1091 ) ### What this PR does / why we need it? View optimization in torchair (defaulted to on for Transpose with any of its axis being 1) prevents the weight Transpose to be fused with later GroupedMatmul, which decrease the performance of MoE layer when expert parallelism equals the total number of experts (e.g. EP256 for DSKv3). Add an option to solve this problem by disabling the optimization. ### Does this PR introduce _any_ user-facing change? Controlled by `additional_config.torchair_graph_config.enable_view_optimize`, defaulted to `True`. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-07 14:28:20 +08:00
Yikun Jiang	8d00775fce	[SpecDecode][CI] Set default values to fix spec decode and fix multicard CI (#1109 ) ### What this PR does / why we need it? - Set default values to fix spec decode - To avoid oom, we need to run the test in a single process ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed, espcecially multicards CI - For spec decode test, long term CI passed Closes: https://github.com/vllm-project/vllm-ascend/pull/1105 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>	2025-06-07 11:23:30 +08:00
weijinqian0	e9ada685ec	[CI]Moe alltoall communication optimization (#1067 ) [CI]Moe alltoall communication optimization The DeepSeek V3/R1 model has 256 routing experts. During parallel inference, if the load of an EP rank is high, the overall communication and computing time is slowed down, which becomes a weakness of parallel inference because the load is unevenly distributed. However, the data volume in the prefill phase is large, and the inter-card communication time consumption/calculation time consumption and the data volume are closely related to each other. Therefore, less non-linear precision loss can be used to obtain a near-linear performance improvement. During parallel inference, global synchronization occurs during communication. As a result, the card with low load completes the calculation first and waits for the card with the highest load to complete the calculation. Therefore, if the load is unbalanced, the card with high load slows down the overall time consumption. Significant performance gains can be achieved by discarding a small number of tokens, which is unacceptable in some precision-sensitive scenarios. However, similar to quantification, it is a solution that uses an acceptable precision loss in some scenarios for performance. In addition, a trade-off between performance and precision can be achieved by configuring a proportion of discarded tokens. Perform the test on A3. The batch size is 8 (B), the prompt length is 3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2, AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15% performance gain. Plus, the next version, we'll have an alltoallv moe. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-06-07 10:15:56 +08:00
Li Wang	a2552e10e4	[Worker][V1] Support sleep mode for v1 (#1084 ) ### What this PR does / why we need it? Support sleep mode for v1 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 21:54:02 +08:00
wangxiyuan	0395ab30be	[Doc] Add graph mode user doc (#1083 ) Add graph mode user guide doc. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 21:14:34 +08:00
ApsarasX	9a4eb94ca9	[Misc] Adjust the default profiler configuration (#1097 ) ### What this PR does / why we need it? When profiling, it is often necessary to disable the call stack to reduce profiling overhead, and adjust the profiler_level to level1 to obtain more detailed operator and communication information. Therefore, it is recommended to modify the default profiling configuration. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-06-06 20:25:59 +08:00
Shanshan Shen	5d0e9fd19a	[Misc] Add `ACL_OP_INIT_MODE` env var and set default to `1` (#597 ) ### What this PR does / why we need it? Fix the bug in torch 2.5.1 that raising segment fault when enable `pin_memory` while creating a tensor using `torch.tensor`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-06 20:22:51 +08:00
Li Wang	11a7df4270	[ModelRunner] Support embedding inputs (#916 ) ### What this PR does / why we need it? - Adds support for passing prompt_embeds to LLM.generate as ```bash llm.generate({"prompt_embeds": input_embeds}, sampling_params) ``` or ```bash llm.generate( [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params ) ``` - Add `prompt_embeds` to examples ### How was this patch tested? CI passed with new added/existing test. and I have test with the example script in this pr, and the output seems looks good: ```bash [Single Inference Output] ------------------------------ The capital of France is Paris. Paris is the largest city in France and is ------------------------------ Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3966.87it/s] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s] [Batch Inference Outputs] ------------------------------ Q1: Please tell me about the capital of France. A1: The capital of France is Paris. It is located in the northern part of the Q2: When is the day longest during the year? A2: The day is longest during the year at the summer solstice. This typically occurs Q3: Where is bigger, the moon or the sun? A3: The sun is significantly bigger than the moon. The sun has a diameter of ------------------------------ ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 20:21:13 +08:00
NeverRaR	c7f1c59911	feat: support compile multiple batch graph (#1085 ) ### What this PR does / why we need it? support compile multiple batch graph with different code object to avoid cache invalidation ### How was this patch tested? ``` export VLLM_ENABLE_MC2=0 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=8 \ -dp=2 \ --no-enforce-eager \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 32768 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config": {"enabled": true,"use_cached_graph": true,"graph_batch_sizes": [8,16,24]},"ascend_scheduler_config": {"enabled":true,"chunked_prefill_enabled":false},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.95 &> run.log & disown ``` Signed-off-by: boying <897013703@qq.com>	2025-06-06 20:17:51 +08:00
Mengqing Cao	c46632439a	[Bugfix][DP] Add with_prefill_across_dp to AscendMetadata to fix dp (#1094 ) ### What this PR does / why we need it? Add `with_prefill_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by #1012, which add an arg `with_prefill_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-06 19:20:33 +08:00
hahazhky	0b12c2acf7	[Kernel] Remove cumsum in groupedmatmul (#987 ) ### What this PR does / why we need it remove cumsum operator in MOE to improve performance ### How was this patch tested? it should be tested on a case with mc2 operator and graph mode enabled Signed-off-by: zhky <hahazhky@163.com> Co-authored-by: 洪炜杰 <hongweijie1@huawei.com>	2025-06-06 19:17:27 +08:00
wangxiyuan	dab19d5dca	[BugFix] Fix ascend config check (#1092 ) Fix the ascend config check logic: 1. refactor check_ascend_config to make it clear: 1. torchair graph should not work with enforce_eager=True 2. aclgraph should not work with torchair graph 3. add refresh config for rlhf case 4. fix a typo in model runner 5. change expert_tensor_parallel_size default to 0 to keep the same as before Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 18:54:37 +08:00
wangxiyuan	973f993a13	[Misc] fix initialize_kv_cache (#1102 ) KV cache manger has been changed by `f8a1a2d108` This PR adapt the change into vllm-ascend to make ci happy Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 16:46:23 +08:00
wangxiyuan	c94afd79ce	[Doc] Update the description for env (#1079 ) Add the description for env to make it more clear for users Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 09:48:43 +08:00
depeng1994	6b094a2bd4	[ModelRunner]Add profile execute duration observation (#1013 ) ### What this PR does / why we need it? We need to observe the time consumed in each stage of inference (including pre-processing, model forward, etc.), without any performance loss. Therefore, we use the event timestamp mechanism of the NPU to mark any stage during the execution of the NPU device (this marking operation is executed asynchronously, with no performance loss). Additionally, we provide a blocking synchronization API `pop_captured_sync` to be called at an appropriate time, to print the time consumed in all observed stages. model_runner_v1.py file only changed 5 lines, all of which were `ProfileExecuteDuration()` calls, and nothing else was changed， while more changes were showed due to the alignment issue. ### Does this PR introduce _any_ user-facing change? Use env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature ### How was this patch tested? Tested in deepseek model，Print like this: ``` 5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms 5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms 5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms 5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms 5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms 5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms 5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms 5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms 5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms 5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms 5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms 5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms 5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms 5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms 5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms 5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms 5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms 5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms ``` --------- Signed-off-by: depeng1994 <depengzhang@foxmail.com>	2025-06-06 09:29:34 +08:00
David9857	78431b3469	[perf]Support MOE Multi-stream in Deepseek (#947 ) ### What this PR does / why we need it? Support MOE inner Multi-stream for Deepseek. This feature requires graph mode with mc2 enabled. --------- Signed-off-by: David9857 <985700846@qq.com>	2025-06-05 23:39:38 +08:00
sherie	908a851a77	optimize the funtion of computing topk and topp in sampler. (#970 ) ### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>	2025-06-05 16:42:18 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
zhangxinyuehfad	7737aaa40f	[CI] Add accuracy test for Qwen2.5-VL-3B-Instruct (#766 ) ### What this PR does / why we need it? Add accuracy test for Qwen2.5-VL-3B-Instruct Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-05 15:09:20 +08:00
Li Wang	b4cb0eecb6	[CI] Hotfix on benchmark results path (#1076 ) ### What this PR does / why we need it? Fix benchmark results path ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-05 12:53:46 +08:00
Yikun Jiang	fd136e6762	Add vLLM Ascend project governance docs (#1070 ) ### What this PR does / why we need it? Add vLLM Ascend project governance and first contributors docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Closes: https://github.com/vllm-project/vllm-ascend/issues/828 Closes: https://github.com/vllm-project/vllm-ascend/issues/929 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-05 11:56:51 +08:00
Li Wang	31dd471574	[CI] Add workflow_dispatch and use main benchmarks directly (#1071 ) ### What this PR does / why we need it? This is for the benchmark iteration, which will change the benchmark scripts while checkouting each commit. So we need ensure the benchmark scripts always available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-05 10:29:30 +08:00
Yikun Jiang	9e855b70be	Adjust concurrency group for each npu workflow (#1068 ) ### What this PR does / why we need it? Adjust concurrency group for each npu workflow - for pd and benchmarks share the static-08-01, so only one job can runs on - other job one PR/schedule should have only 1 job runs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-05 09:17:04 +08:00
Mengqing Cao	afc4c0cd03	[Bugfix] Fix deepseek percision issue and add acc ci for it (#905 ) ### What this PR does / why we need it? Fix deepseek percision issue on V0 and add acc ci for it Fixes https://github.com/vllm-project/vllm-ascend/issues/1062 ### How was this patch tested? CI passed with new added test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-04 20:26:44 +08:00
NeverRaR	da9acfca60	feat: support data parallel for deepseek (#1012 ) ### What this PR does / why we need it? feat: support data parallel for deepseek ### Does this PR introduce _any_ user-facing change? Yes, support dp for deepseek ### How was this patch tested? ``` export VLLM_ENABLE_MC2=0 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh nohup python -m vllm.entrypoints.openai.api_server --model=/path/to/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=8 \ -dp=2 \ --max-num-seqs 24 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --block-size 128 \ -O 0 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}' \ --gpu-memory-utilization 0.95 &> run.log & disown ``` Signed-off-by: boying <897013703@qq.com>	2025-06-04 18:31:41 +08:00
Li Wang	517811449e	[CI] Re-enable sleep mode test and skip failure breaking CI (#990 ) ### What this PR does / why we need it? - Re-enable sleep mode test - Fix nightly performance benchmark workflow - Fix model-runner-v1 bug for upstream [change](https://github.com/vllm-project/vllm/pull/18654) --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-04 16:24:16 +08:00
Li Wang	eb2701e0b2	[CI] Remove workflow_dispatch and change schedule time (#1056 ) ### What this PR does / why we need it? - Remove workflow_dispatch - Change schedule time to 2:00 UTC+8 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed --------- Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-04 01:19:20 +08:00
Li Wang	06fb5a8d81	[CI][Bugfix] Upgrade escli to v0.2.1 to fix benchmark deps (#1055 ) ### What this PR does / why we need it? Update escli-tool to v.0.2.1 to fix deps bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangli <858794774@qq.com>	2025-06-04 01:03:56 +08:00
Li Wang	76dacf3fa0	[CI][Benchmark] Optimize performance benchmark workflow (#1039 ) ### What this PR does / why we need it? This is a post patch of #1014, for some convenience optimization - Set cached dataset path for speed - Use pypi to install escli-tool - Add benchmark results convert script to have a developer-friendly result - Patch the `benchmark_dataset.py` to disable streaming load for internet - Add more trigger ways for different purpose, `pr` for debug, `schedule` for daily test, `dispatch` and `pr-labled` for manual testing of a single(current) commit - Disable latency test for `qwen-2.5-vl`, (This script does not support multi-modal yet) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-03 23:38:34 +08:00
wangxiyuan	543380ceae	[CI] Add merge conflict label job (#1050 ) Add bot to label merge conflicts, it helps developer and maintainer to do code review and update clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-03 17:32:31 +08:00

1 2 3 4 5 ...

353 Commits