xc-llm-ascend

Author	SHA1	Message	Date
Li Wang	a2552e10e4	[Worker][V1] Support sleep mode for v1 (#1084 ) ### What this PR does / why we need it? Support sleep mode for v1 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 21:54:02 +08:00
wangxiyuan	0395ab30be	[Doc] Add graph mode user doc (#1083 ) Add graph mode user guide doc. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 21:14:34 +08:00
ApsarasX	9a4eb94ca9	[Misc] Adjust the default profiler configuration (#1097 ) ### What this PR does / why we need it? When profiling, it is often necessary to disable the call stack to reduce profiling overhead, and adjust the profiler_level to level1 to obtain more detailed operator and communication information. Therefore, it is recommended to modify the default profiling configuration. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-06-06 20:25:59 +08:00
Shanshan Shen	5d0e9fd19a	[Misc] Add `ACL_OP_INIT_MODE` env var and set default to `1` (#597 ) ### What this PR does / why we need it? Fix the bug in torch 2.5.1 that raising segment fault when enable `pin_memory` while creating a tensor using `torch.tensor`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-06 20:22:51 +08:00
Li Wang	11a7df4270	[ModelRunner] Support embedding inputs (#916 ) ### What this PR does / why we need it? - Adds support for passing prompt_embeds to LLM.generate as ```bash llm.generate({"prompt_embeds": input_embeds}, sampling_params) ``` or ```bash llm.generate( [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params ) ``` - Add `prompt_embeds` to examples ### How was this patch tested? CI passed with new added/existing test. and I have test with the example script in this pr, and the output seems looks good: ```bash [Single Inference Output] ------------------------------ The capital of France is Paris. Paris is the largest city in France and is ------------------------------ Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3966.87it/s] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s] [Batch Inference Outputs] ------------------------------ Q1: Please tell me about the capital of France. A1: The capital of France is Paris. It is located in the northern part of the Q2: When is the day longest during the year? A2: The day is longest during the year at the summer solstice. This typically occurs Q3: Where is bigger, the moon or the sun? A3: The sun is significantly bigger than the moon. The sun has a diameter of ------------------------------ ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 20:21:13 +08:00
NeverRaR	c7f1c59911	feat: support compile multiple batch graph (#1085 ) ### What this PR does / why we need it? support compile multiple batch graph with different code object to avoid cache invalidation ### How was this patch tested? ``` export VLLM_ENABLE_MC2=0 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=8 \ -dp=2 \ --no-enforce-eager \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 32768 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config": {"enabled": true,"use_cached_graph": true,"graph_batch_sizes": [8,16,24]},"ascend_scheduler_config": {"enabled":true,"chunked_prefill_enabled":false},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.95 &> run.log & disown ``` Signed-off-by: boying <897013703@qq.com>	2025-06-06 20:17:51 +08:00
Mengqing Cao	c46632439a	[Bugfix][DP] Add with_prefill_across_dp to AscendMetadata to fix dp (#1094 ) ### What this PR does / why we need it? Add `with_prefill_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by #1012, which add an arg `with_prefill_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-06 19:20:33 +08:00
hahazhky	0b12c2acf7	[Kernel] Remove cumsum in groupedmatmul (#987 ) ### What this PR does / why we need it remove cumsum operator in MOE to improve performance ### How was this patch tested? it should be tested on a case with mc2 operator and graph mode enabled Signed-off-by: zhky <hahazhky@163.com> Co-authored-by: 洪炜杰 <hongweijie1@huawei.com>	2025-06-06 19:17:27 +08:00
wangxiyuan	dab19d5dca	[BugFix] Fix ascend config check (#1092 ) Fix the ascend config check logic: 1. refactor check_ascend_config to make it clear: 1. torchair graph should not work with enforce_eager=True 2. aclgraph should not work with torchair graph 3. add refresh config for rlhf case 4. fix a typo in model runner 5. change expert_tensor_parallel_size default to 0 to keep the same as before Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 18:54:37 +08:00
wangxiyuan	973f993a13	[Misc] fix initialize_kv_cache (#1102 ) KV cache manger has been changed by `f8a1a2d108` This PR adapt the change into vllm-ascend to make ci happy Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 16:46:23 +08:00
wangxiyuan	c94afd79ce	[Doc] Update the description for env (#1079 ) Add the description for env to make it more clear for users Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-06 09:48:43 +08:00
depeng1994	6b094a2bd4	[ModelRunner]Add profile execute duration observation (#1013 ) ### What this PR does / why we need it? We need to observe the time consumed in each stage of inference (including pre-processing, model forward, etc.), without any performance loss. Therefore, we use the event timestamp mechanism of the NPU to mark any stage during the execution of the NPU device (this marking operation is executed asynchronously, with no performance loss). Additionally, we provide a blocking synchronization API `pop_captured_sync` to be called at an appropriate time, to print the time consumed in all observed stages. model_runner_v1.py file only changed 5 lines, all of which were `ProfileExecuteDuration()` calls, and nothing else was changed， while more changes were showed due to the alignment issue. ### Does this PR introduce _any_ user-facing change? Use env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature ### How was this patch tested? Tested in deepseek model，Print like this: ``` 5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms 5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms 5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms 5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms 5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms 5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms 5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms 5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms 5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms 5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms 5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms 5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms 5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms 5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms 5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms 5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms 5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms 5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms ``` --------- Signed-off-by: depeng1994 <depengzhang@foxmail.com>	2025-06-06 09:29:34 +08:00
David9857	78431b3469	[perf]Support MOE Multi-stream in Deepseek (#947 ) ### What this PR does / why we need it? Support MOE inner Multi-stream for Deepseek. This feature requires graph mode with mc2 enabled. --------- Signed-off-by: David9857 <985700846@qq.com>	2025-06-05 23:39:38 +08:00
sherie	908a851a77	optimize the funtion of computing topk and topp in sampler. (#970 ) ### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>	2025-06-05 16:42:18 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
zhangxinyuehfad	7737aaa40f	[CI] Add accuracy test for Qwen2.5-VL-3B-Instruct (#766 ) ### What this PR does / why we need it? Add accuracy test for Qwen2.5-VL-3B-Instruct Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-05 15:09:20 +08:00
Li Wang	b4cb0eecb6	[CI] Hotfix on benchmark results path (#1076 ) ### What this PR does / why we need it? Fix benchmark results path ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-05 12:53:46 +08:00
Yikun Jiang	fd136e6762	Add vLLM Ascend project governance docs (#1070 ) ### What this PR does / why we need it? Add vLLM Ascend project governance and first contributors docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Closes: https://github.com/vllm-project/vllm-ascend/issues/828 Closes: https://github.com/vllm-project/vllm-ascend/issues/929 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-05 11:56:51 +08:00
Li Wang	31dd471574	[CI] Add workflow_dispatch and use main benchmarks directly (#1071 ) ### What this PR does / why we need it? This is for the benchmark iteration, which will change the benchmark scripts while checkouting each commit. So we need ensure the benchmark scripts always available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manaully --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-05 10:29:30 +08:00
Yikun Jiang	9e855b70be	Adjust concurrency group for each npu workflow (#1068 ) ### What this PR does / why we need it? Adjust concurrency group for each npu workflow - for pd and benchmarks share the static-08-01, so only one job can runs on - other job one PR/schedule should have only 1 job runs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-05 09:17:04 +08:00
Mengqing Cao	afc4c0cd03	[Bugfix] Fix deepseek percision issue and add acc ci for it (#905 ) ### What this PR does / why we need it? Fix deepseek percision issue on V0 and add acc ci for it Fixes https://github.com/vllm-project/vllm-ascend/issues/1062 ### How was this patch tested? CI passed with new added test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-04 20:26:44 +08:00
NeverRaR	da9acfca60	feat: support data parallel for deepseek (#1012 ) ### What this PR does / why we need it? feat: support data parallel for deepseek ### Does this PR introduce _any_ user-facing change? Yes, support dp for deepseek ### How was this patch tested? ``` export VLLM_ENABLE_MC2=0 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh nohup python -m vllm.entrypoints.openai.api_server --model=/path/to/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=8 \ -dp=2 \ --max-num-seqs 24 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --block-size 128 \ -O 0 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}' \ --gpu-memory-utilization 0.95 &> run.log & disown ``` Signed-off-by: boying <897013703@qq.com>	2025-06-04 18:31:41 +08:00
Li Wang	517811449e	[CI] Re-enable sleep mode test and skip failure breaking CI (#990 ) ### What this PR does / why we need it? - Re-enable sleep mode test - Fix nightly performance benchmark workflow - Fix model-runner-v1 bug for upstream [change](https://github.com/vllm-project/vllm/pull/18654) --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-04 16:24:16 +08:00
Li Wang	eb2701e0b2	[CI] Remove workflow_dispatch and change schedule time (#1056 ) ### What this PR does / why we need it? - Remove workflow_dispatch - Change schedule time to 2:00 UTC+8 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed --------- Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-04 01:19:20 +08:00
Li Wang	06fb5a8d81	[CI][Bugfix] Upgrade escli to v0.2.1 to fix benchmark deps (#1055 ) ### What this PR does / why we need it? Update escli-tool to v.0.2.1 to fix deps bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangli <858794774@qq.com>	2025-06-04 01:03:56 +08:00
Li Wang	76dacf3fa0	[CI][Benchmark] Optimize performance benchmark workflow (#1039 ) ### What this PR does / why we need it? This is a post patch of #1014, for some convenience optimization - Set cached dataset path for speed - Use pypi to install escli-tool - Add benchmark results convert script to have a developer-friendly result - Patch the `benchmark_dataset.py` to disable streaming load for internet - Add more trigger ways for different purpose, `pr` for debug, `schedule` for daily test, `dispatch` and `pr-labled` for manual testing of a single(current) commit - Disable latency test for `qwen-2.5-vl`, (This script does not support multi-modal yet) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-03 23:38:34 +08:00
wangxiyuan	543380ceae	[CI] Add merge conflict label job (#1050 ) Add bot to label merge conflicts, it helps developer and maintainer to do code review and update clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-03 17:32:31 +08:00
Yikun Jiang	f24375f318	Enable accuracy test for PR labeled with "accuracy-test" (#1040 ) ### What this PR does / why we need it? This PR enable accuracy test for PR labeled with "accuracy-test" and workflow_dispatch. Only one model test running for each type test to reduce excution time. - The dense test costs about `25mins` to complete (gsm8k 7mins, ~mmlu 3h24mins,~ cEval 18mins) - The vl test costs about `40mins` to complete In futute, we might consider enable all job test as nightly schedule job. Below is mainly changes: - the dense/vl accuracy test will be triggered by lableling `accuracy-test` and `ready-for-test` - the dense accuracy test will be triggered by lableling `dense-accuracy-test` and `ready-for-test` - the vl accuracy test will be triggered by lableling `vl-accuracy-test` and `ready-for-test` - accuracy test will also be triggered by workflow_dispatch - Support V1 and V0 for qwen and V0 for VL For PR test we also generate summary in test summary. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed with accuracy-test label - Preview: https://github.com/vllm-project/vllm-ascend/actions/runs/15407628722?pr=1040 Closes: https://github.com/vllm-project/vllm-ascend/pull/953 --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-06-03 15:38:13 +08:00
Shanshan Shen	068c3a0167	[Bugfix] Add verification for `quant_action.choices` to avoid `TypeError` (#1046 ) ### What this PR does / why we need it? When I run vllm-ascend, I get this error msg: ```bash Traceback (most recent call last): File "/home/sss/software/miniconda3/envs/vllm-v1/bin/vllm", line 8, in <module> sys.exit(main()) File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/main.py", line 50, in main cmd.subparser_init(subparsers).set_defaults( File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/serve.py", line 101, in subparser_init serve_parser = make_arg_parser(serve_parser) File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 254, in make_arg_parser parser = AsyncEngineArgs.add_cli_args(parser) File "/home/sss/github/vllm-project/vllm/vllm/engine/arg_utils.py", line 1582, in add_cli_args current_platform.pre_register_and_update(parser) File "/home/sss/github/vllm-project/vllm-ascend/vllm_ascend/platform.py", line 80, in pre_register_and_update if ASCEND_QUATIZATION_METHOD not in quant_action.choices: TypeError: argument of type 'NoneType' is not iterable [ERROR] 2025-06-03-02:53:42 (PID:6005, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` This is because the `choices` attribute in `quant_action` can be `None` and we don't check it. ```bash # quant_action _StoreAction(option_strings=['--quantization', '-q'], dest='quantization', nargs=None, const=None, default=None, type=<class 'str'>, choices=None, required=False, help='Method used to quantize the weights. If `None`, we first check the\n`quantization_config` attribute in the model config file. If that is\n`None`, we assume the model weights are not quantized and use `dtype` to\ndetermine the data type of the weights.', metavar=None) ``` Thus, I have added check for the `choices` to handle the scenario of `choices=None`. ### Does this PR introduce _any_ user-facing change? yes, vllm server with ascend quantization works now. ### How was this patch tested? by `vllm server --quantization ascend` command. Related: https://github.com/vllm-project/vllm/issues/19004 Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-03 11:44:45 +08:00
Shanshan Shen	93860574bb	[ModelRunner][MultiModal] Remove legacy input mapper/processor from V0 (#951 ) ### What this PR does / why we need it? Remove legacy input mapper/processor from V0. Find more details at https://github.com/vllm-project/vllm-ascend/issues/673 and https://github.com/vllm-project/vllm/pull/15686. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Launch online service: ```bash vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ --dtype bfloat16 \ --max_model_len 32768 \ --max-num-batched-tokens 32768 ``` Query the server: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, {"type": "text", "text": "What is the text in the illustrate?"} ]} ] }' ``` Result: ```bash {"id":"chatcmpl-619e70733ed148b3be3a0b6524ee0ef3","object":"chat.completion","created":1748226332,"model":"/home/sss/.cache/modelscope/hub/models/Qwen/Qwen2___5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"pro ``` Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-03 11:32:03 +08:00
NINGBENZHE	6ec64a3f96	[bugfix] some bugs maybe fail to run (#896 ) ### What this PR does / why we need it? Solve the bug that the graph mode is the same as p and d, and some other bugs. ### Does this PR introduce _any_ user-facing change? Wouldn't be ### How was this patch tested? Follow the end-to-end test Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com>	2025-06-03 11:07:33 +08:00
Yikun Jiang	92bc5576d8	Skip benchmarks/ in vllm ascend test (#1041 ) ### What this PR does / why we need it? Skip benchmarks/ in vllm ascend test to reduce CI cost ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-01 19:01:26 +08:00
NeverRaR	507ae627ca	feat: support compile torchair graph while warming up (#839 ) ### What this PR does / why we need it? feat: support compile torchair graph while warming up Signed-off-by: boying <897013703@qq.com>	2025-05-31 06:03:03 +08:00
Li Wang	d9fb027068	[CI] Add benchmark workflows (#1014 ) ### What this PR does / why we need it? Add benchmark workflows ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run locally --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-05-30 22:42:44 +08:00
yiz-liu	5a1689fc64	[Fix] Fix update_aclgraph_sizes when running MoE models (#913 ) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-30 15:17:11 +08:00
XWFAlone	3442fbdb23	[1/N][UT][v1 MTP] add basic v1 mtp features (#890 ) ### What this PR does / why we need it? add basic v1 mtp features please merge it after https://github.com/vllm-project/vllm-ascend/pull/874 and https://github.com/vllm-project/vllm-ascend/pull/844. ### Does this PR introduce _any_ user-facing change? now, we supported basic v1 mtp, only supported tp only、eager mode and k=1 we will continue to expand more scenarios. ### How was this patch tested? local tested Signed-off-by: XWFAlone <xuewenfei2@huawei.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: JC-ut0 <xuyexiong@huawei.com>	2025-05-30 08:59:58 +08:00
wangxiyuan	5903547d09	[doc] add 0.7.3.post1 release note (#1008 ) Add release note for 0.7.3.post1 Add the missing release note back for 0.7.3 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-29 17:38:34 +08:00
22dimensions	c464c32b81	add doc for offline quantization inference (#1009 ) add example for offline inference with quantized model Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-05-29 17:32:42 +08:00
zouyida2052	05a471001b	bugfix for qwen2_5_vl (#805 ) ### What this PR does / why we need it? the interface of qwen2.5vl changes from column linear to qkv linear, this makes our weight pad func become abnormal, thus we optimize split_qkv func to fix this bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? with CI Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-05-29 17:20:39 +08:00
Mengqing Cao	a93bed4535	[aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (#836 ) ### What this PR does / why we need it? 1. Implentment `NPUPiecewiseBackend` to enable aclgraph 2. Eable aclgraph by default in V1, but raise error when running deepseek and raise warning when running models except for qwen ### How was this patch tested? CI pass with the new ut --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-29 11:58:26 +08:00
Mengqing Cao	cc74b97f74	[Bugfix][V1] Fix deepseek with v1 (#958 ) ### What this PR does / why we need it? Fix deepseek with v1, this error is introdeced by https://github.com/vllm-project/vllm-ascend/pull/945. and this pr fix the block table of mla ### How was this patch tested? CI passed with new addedtest. Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-05-29 11:57:43 +08:00
ApsarasX	e3c7f71462	[Perf] Refactor tensor disposal logic to reduce memory usage (#966 ) ### What this PR does / why we need it? 1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580 https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. Reference: https://github.com/sgl-project/sglang/pull/6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-05-29 11:48:26 +08:00
Mengqing Cao	6eddbd2521	[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889 ) Initialize PD Disaggreate UT --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-29 10:17:12 +08:00
wangxiyuan	f6e5decc10	[CI] upgrade to vllm 0.9.0 (#959 ) Upgrade to vllm 0.9.0. 0.8.5 will not be supported any more. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-28 21:18:41 +08:00
wangxiyuan	e2a0c19cea	[CI] Refactor CI (#952 ) 1. remove some useless test func and file 2. fix format.sh problem 3. enable full test for singlecard and multicard 4. move long term test to long_term folder. For this kind of test, it only runs by labeled and daily test. Include: spec decode、accuracy test ## After refactor: There are 4 test modules - `singlecard`: contains the test running on one NPU. It'll be run for each PR and daily test. - `multicard`: contains the test running on multi NPUs. It'll be run for each PR and daily test. - `long_term`: contains the test that cost much time(Now include `spec decode` and `accuracy` test). It'll be run for the PR with `long-term-test` labeled and daily test. - `e2e`: contains the test for doc and pd feature. It'll be run for the PR with `pd-test` labeled and daily test. ## Todo: 1. some test are skipped, they should be fixed and reenabled in the future. 2. pyhccl test for multicard doesn't work at all. It should be enabled as well. 3. ensure long-term-test pass by daily test. ### Know issue Now, `ready` labels is required to start pd test or long term test. And when `long-term-test` or `pd-test` is labeled after another one, the old labeled test will be re-run again. So the labeled test should be ran in the following step: 1. decide which test need run, then label it. `long-term-test` or `pd-test` or both. 2. add `ready-for-test` label, then the test will be ran. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-28 06:31:35 +08:00
Angazenn	9f5ab59e30	[WIP][BugFix]Fix accuracy issues caused by wrong etp_size passed into FusedMoEParallelConfig when using vLLM 0.9.0 (#961 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR fix accuracy issues incurred by codes that adapt to `FusedMoEParallelConfig` in vLLM 0.9.0 version. The `tp_size` used to split weights are wrongly passed. The root cause is that vLLM community and vLLM-Ascend are using different methods to decide whether to use Expert Parallel. vLLM: vLLM use a flag `enable_expert_parallel` to indicate whether to use EP and use the following codes to decide `ep_size`: ``` use_ep = (dp_size_ * tp_size_ > 1 and vllm_parallel_config.enable_expert_parallel) dp_size = dp_size_ dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0 tp_size, tp_rank = flatten_tp_across_dp(dp_rank) if not use_ep: return FusedMoEParallelConfig(tp_size=tp_size, tp_rank=tp_rank, dp_size=dp_size, dp_rank=dp_rank, ep_size=1, ep_rank=0, use_ep=False) # DP + EP / TP + EP / DP + TP + EP assert use_ep # In EP, each device owns a set of experts fully. There is no tensor # parallel update tp_size, tp_rank, ep_size and ep_rank to reflect that. ep_size = tp_size ep_rank = tp_rank return FusedMoEParallelConfig(tp_size=1, tp_rank=0, dp_size=dp_size, dp_rank=dp_rank, ep_size=ep_size, ep_rank=ep_rank, use_ep=True) ``` vLLM-Ascend: vLLM-Ascend uses `etp` to specify Tensor Parallel in MoE. ``` self.ep_size = get_ep_group().world_size self.tp_size = get_etp_group().world_size self.dp_size = (dp_size if dp_size is not None else get_dp_group().world_size) ``` So there will be conflicts if we simply combine these codes together. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-27 15:16:17 +08:00
Shuqiao Li	01e3d59eae	add workflow to build and release wheel (#775 ) ### What this PR does / why we need it? This is a continuing work of #716. This PR add workflow to build and release wheel, and also release source to PYPI. We have 3 conditions to trigger the workflow: 1. PR to `main` and `-dev` 2. push to `main` and `-dev` 3. push tag with name of `v*` Release to PYPI will only be done under condition 3. Under condition 1 and 2, it will generate .tar.gz and build .whl, upload to github artifacts but will not release. update: Will build .whl and upload to github artifacts with scheduled task. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All triggered conditions are well tested with my fork repo. --------- Signed-off-by: Shuqiao Li <celestialli@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-05-26 14:18:26 +08:00
Mengqing Cao	a0c3e9ba50	[Bugfix] Adjust inputbatch to be compatible with latest vllm (#945 ) Adjust inputbatch to be compatible with latest vllm, as kvcache group feature has been redo in https://github.com/vllm-project/vllm/pull/18593 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-26 10:33:28 +08:00
Angazenn	1f9fb869ad	[BugFix] Fix accuracy bugs for unquantized deepseekv3 models (#897 ) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR #819 when running deepseekv3 series models: 1. #819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-24 14:29:36 +08:00
yiz-liu	17f05b1089	[Feature] Add CustomQwen3MoeForCausalLM model (#925 ) Tweak packed_modules_mapping to support W8A8 weights. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-23 15:50:48 +08:00

1 2 3 4 5 ...

330 Commits