xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	3879d9cad9	[CI] Fix sample backward compatibility problem (#648 ) `b411418ff0` this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-25 11:53:26 +08:00
yiz-liu	d785e78563	[V1] Make V1 engine backward compatible (#637 ) ### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-24 17:20:11 +08:00
Li Wang	a9c6b52205	[Bugfix] Fix qwen2.5-vl positon input bug (#639 ) ### What this PR does / why we need it? Fix qwen2.5-vl positon input bug, fix #625 `TypeError: 'NoneType' object is not iterable` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-24 15:21:57 +08:00
Bug Hunter Yan	05bdcbeae4	support aclgraph (#426 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default， Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-04-23 20:56:24 +08:00
zzzzwwjj	5c6d05a59e	support deepseek quant & mix-parallel with graphmode (#585 ) ### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>	2025-04-23 16:23:25 +08:00
Pleaplusone	e74331a1ed	Add dp initialize patch with hccl backend (#626 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Add dp stateless process group initialization path with hccl backend as vllm-ascend patch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-23 15:47:51 +08:00
Shanshan Shen	4a0ce3660e	[Misc] Remove some parts of metrics patch (#603 ) ### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by https://github.com/vllm-project/vllm/pull/14411. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-22 18:45:21 +08:00
wangxiyuan	538a69c145	[Patch] format patch module to make it more clear (#601 ) Format patch module to make it more clear. Add the patch doc description, the new patch must follow this guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-22 14:13:00 +08:00
Pleaplusone	d12a057df8	Add note for deepseek related docs and remove unnecessary comments (#590 ) ### What this PR does / why we need it? Add notes for deepseek's patch and remove some of the unnecessary comments --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-22 09:59:09 +08:00
paulyu12	a8d633f629	[Bugfix] fix import error (#600 ) ### What this PR does / why we need it? Fix the import error that https://github.com/vllm-project/vllm-ascend/issues/592 mentioned. Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-04-22 08:57:25 +08:00
wemaster	0ae9ee0f8a	[BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593 ) ### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` vllm_ascend/attention/attention.py 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. vllm_ascend/distributed/parallel_state.py 1. remove 2 assert because spec decode worker would use init_worker twice vllm_ascend/models/deepseek_mtp.py 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` vllm_ascend/quantization/quant_config.py 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` other 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-21 19:25:51 +08:00
Shuqiao Li	5442b463fd	add doc for patch_config (#574 ) ### What this PR does / why we need it? add doc for patch_config ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No code changed. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-21 10:33:38 +08:00
Yikun Jiang	12cae04db9	[quantization] Support w8a8 quantization (#580 ) ### What this PR does / why we need it? Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model has [quantize filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27). If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply, otherwise will use VLLMAscendQuantizer directly. - This patch fix installation docs to make installation work - This patch enable norm quantization by patch `RMSNorm.__init__`, `RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model` - Add `AscendW8A8LinearMethod` for W8A8 - Add `AscendW8A8DynamicLinearMethod` and `AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC - Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` ### Does this PR introduce _any_ user-facing change? Yes, support w8a8 quantization. After this patch supported, users can use below commands to run w8a8 models: ``` vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B" ``` ### How was this patch tested? 0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` 1. From @Yikun: I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls refer to https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613 2. From @dingdingchaomian : Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both models were quantized using Ascend's msmodelslim tool: - Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one for w8a8 dynamic. - Deepseek-v2-lite-chat were tested once because its quantization used both static and dynamic w8a8. Models were tested using both off line inference and online serving, and both work well. The inference codes are exactly the same with the examples in https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with model path and tensor parallel number changed. --------- Signed-off-by: dingdingchaomian <wangce21@huawei.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: dingdingchaomian <wangce21@huawei.com> Co-authored-by: Angazenn <zengyanjia@huawei.com> Co-authored-by: liujiaxu <liujiaxu4@huawei.com> Co-authored-by: ApsarasX <apsarax@outlook.com> Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>	2025-04-20 18:14:05 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
Shuqiao Li	a127cc83f8	catch ImportError when C code not compiled (#575 ) ### What this PR does / why we need it? Found a problem when ImportError raised but not ModuleNotFoundError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 18:11:49 +08:00
Shanshan Shen	65c1f4579f	[V1][Structured Output] Add `apply_grammar_bitmask()` method to model runner (#555 ) ### What this PR does / why we need it? Add `apply_grammar_bitmask()` method to model runner. This method is necessary for `xgrammar` structured output. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 16:47:55 +08:00
Shuqiao Li	84563fc65d	Add sleep mode feature for Ascend NPU (#513 ) ### What this PR does / why we need it? This PR adds sleep mode feature for vllm-ascend, when sleeps, we do mainly two things: - offload model weights - discard kv cache RLHF tools(such as https://github.com/volcengine/verl and https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode to accelerate the training process. This PR may solve #375 and #320 . ### Does this PR introduce _any_ user-facing change? No existing user interfaces changed. Users will have two new methods(`sleep()` and `wake_up()`) to use. ### How was this patch tested? This PR is tested with Qwen/Qwen2.5-0.5B-Instruct. At first, we have free NPU memory M1. After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)` executed, we have free NPU memory M2. M2 < M1. Then we call `llm.sleep(level=1)`, we have free NPU memory M3. We have M3 > M2, M3 is very close to M1. Plus, we have the same output tokens before sleep and after wake up, with the config of `SamplingParams(temperature=0, max_tokens=10)` and with the same input tokens of course. This PR is utilizing the CMake procedure of #371 , thanks a lot. Signed-off-by: Shuqiao Li <celestialli@outlook.com>	2025-04-18 13:11:39 +08:00
wangxiyuan	42c7fbb10e	[Misc] Fix import error and address nits to make CI happy (#563 ) 1. Add `vllm_version_is` function to check vllm version. 2. `ensure_kv_transfer_initialized` and `get_kv_transfer_group ` have been moved to other place in vllm main branch via `3408e47159` , this patch fix the import error. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-18 12:23:32 +08:00
Pleaplusone	66a0837963	adopt rope in vllm-ascend (#530 ) ### What this PR does / why we need it? Adopt custom kernel rotary embedding in actual model inference, customized rotary_embedding will generate contiguous query and key in the cpp side to reduce the overhead of two contiguous and index_select compared with rotary_embedding in torch_npu. For now, rotary_embedding can only support the scenario of `is_neox = true`, non-neox version rope will be updated soon in the future. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-18 08:56:05 +08:00
whx	23f85e3f74	[BugFix] Fix scheduler problems in last PR. (#558 ) This PR Fixes scheduler problems in last PR: 1. change position of DT test to validate it. 2. fix format of copyright. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-04-18 08:49:48 +08:00
Mengqing Cao	6ee7f5cf71	[SpecDecode] Add spec decode support (#500 ) ### What this PR does / why we need it? Backport: https://github.com/vllm-project/vllm-ascend/pull/252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: https://github.com/vllm-project/vllm-ascend/pull/423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO： - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>	2025-04-17 20:16:32 +08:00
whx	20dff4deff	[Scheduler] Add AscendScheduler. (#543 ) This PR adds AscendScheduler to vllm v1 engine. This scheduler currently supports v0-style prefill-first scheduling strategy. In the future more schedule methods will be supported by this scheduler. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-04-17 19:31:50 +08:00
paulyu12	697908f5cd	[Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521 ) ### What this PR does / why we need it? According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop #396](https://github.com/vllm-project/vllm-ascend/issues/396) and this [vLLM Ascend Roadmap Q2 2025 #448](https://github.com/vllm-project/vllm-ascend/issues/448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-04-17 16:48:46 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
Huazhong Ji	c3d1a3782a	Add pyhccl (#503 ) This is the first step to support trl vllm serve on Ascend NPU https://github.com/vllm-project/vllm-ascend/issues/459. This PR can work properly only when https://github.com/vllm-project/vllm/pull/16464 is merged into vLLM. --------- Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>	2025-04-17 14:57:52 +08:00
Mengqing Cao	6061f33670	[Bugfix][Model] Fix api in DeepSeek model (#545 ) ### What this PR does / why we need it? Fix api in DeepSeekV2, aligning with the latest code of the main branch in vllm. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test locally with deepseek-v2-lite, and will add CI by @Potabk. Plz update the model UT after this pr is merged, thx! cc @Potabk Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-17 11:56:05 +08:00
Shanshan Shen	415ed027fa	[V1][Platform] Remove `supports_structured_output()` in platform (#531 ) ### What this PR does / why we need it? Remove `supports_structured_output()` in platform. This method is no need, because upstream has deleted this. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-16 09:30:33 +08:00
wangxiyuan	bbe7ccd366	[MISC] Add patch module (#526 ) This PR added patch module for vllm 1. platform patch: the patch will be registered when load the platform 2. worker patch: the patch will be registered when worker is started. The detail is: 1. patch_common: patch for main and 0.8.4 version 4. patch_main: patch for main verison 5. patch_0_8_4: patch for 0.8.4 version	2025-04-16 09:28:58 +08:00
Shanshan Shen	bcbc04f92b	[Doc] Add environment variables doc (#519 ) ### What this PR does / why we need it? Add environment variables doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-15 16:09:36 +08:00
eeethenQ	44a8301424	[Feature] Add PD separation feature (#432 ) ### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>	2025-04-15 15:11:35 +08:00
wangxiyuan	c7f6584d75	[V1] clean up V1 code (#505 ) Clean up V1 code: 1. remove useless code. 2. format code to be clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:24:02 +08:00
wangxiyuan	f6af1d2471	[MISC] fix logger (#515 ) logger in vllm-ascend doesn't work. This PR fix the issue. Fix: https://github.com/vllm-project/vllm-ascend/issues/431 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:18:05 +08:00
wangxiyuan	9c7428b3d5	[CI] enable custom ops build (#466 ) ### What this PR does / why we need it? This PR enable custom ops build by default. ### Does this PR introduce _any_ user-facing change? Yes, users now install vllm-ascend from source will trigger custom ops build step. ### How was this patch tested? By image build and e2e CI --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-12 10:24:53 +08:00
Mengqing Cao	f6cf92e7d5	[quant][bugfix] fix deepseek quant bug (#478 ) see #465 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2025-04-08 09:15:56 +08:00
Shanshan Shen	1d88dacf9f	[V1][Platform] Add `supports_structured_output()` method to Platform (#475 ) ### What this PR does / why we need it? Add `supports_structured_output()` method to Platform, find more details at https://github.com/vllm-project/vllm/pull/16148. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-07 19:11:51 +08:00
Mengqing Cao	344228a5da	[deepseek][bugfix] support deepseek quant (#469 ) - support deepseek quant - add w8a8_dynamic quant see #391 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2025-04-07 10:56:12 +08:00
Li Wang	3f9752f8ee	[Bugfix]Lazy import vllm config (#462 ) ### What this PR does / why we need it? Lazy import vllm config to avoid circular imports --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-03 16:03:08 +08:00
Pleaplusone	ce8259975e	[core] Support custom ascendc kernels in vllm-ascend (#233 ) This PR add custom ascendc kernel rotary_embedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. Related: https://github.com/vllm-project/vllm-ascend/issues/156 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-03 14:52:34 +08:00
Shanshan Shen	14d9a64047	[ModelRunner][V1] Optimize V1 attention mask (#442 ) ### What this PR does / why we need it? Pre-construct a mask matrix to improve the efficiency of attention mask construction during inference. Note that the length of the matrix needs to be carefully balanced: a matrix that is too large will consume excessive VRAM, while a matrix that is too small will require dynamic concatenation during inference, leading to performance degradation. Therefore, an environment variable is added here to dynamically set the size of the pre-constructed mask matrix based on requirements. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-04-02 10:33:53 +08:00
Mengqing Cao	2dbd763584	[CI] Fix mypy CI (#443 ) ### What this PR does / why we need it? Fix CI by updating mypy and pining numpy version _the modification of model_runner_v1 is just to make CI happy_ ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-01 09:25:33 +08:00
wangxiyuan	31f29b9f30	[Core] Make V1 work and enable V1 engine test (#389 ) 1. Make sure the version is string before parse in collect_env 2. Add basic V1 engine test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-28 19:34:23 +08:00
wuhuikx	57a84bb7be	[Bug Fix] Fix bug of platform for parameter checking (#411 ) Fix bug in platform.py to avoid the None value of config parameters. Signed-off-by: wuhuikx <wuhui_csu@163.com>	2025-03-28 16:31:27 +08:00
Tony	b1557abab6	fix multistep bug,remove uselesscodes (#355 ) 1. remove useluss code in attention.py 2. multistep now using StatefulModelInputForNPU and do not use StatefulModelInput Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>	2025-03-28 09:55:35 +08:00
BAI Fan	122505208f	FastPatch: Optimized Patch Embedding for Qwen2VL (#345 ) ### What this PR does / why we need it? We proposed the FastPatch method, which optimized patch embedding (Conv3D) for Qwen2VL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is better than original patch_embed layer. --------- Signed-off-by: baifanxxx <baifanxxx@gmail.com> Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>	2025-03-26 14:28:20 +08:00
Shanshan Shen	89ca63a2c2	[Bugfix] Disable torch.compile() (#370 ) ### What this PR does / why we need it? To resolve this [patch](https://github.com/vllm-project/vllm-ascend/pull/236/files#diff-43b96b39b5a52fe209d86449ad703a7ff5e1349ebaf1aa12ece8d82163ee5b61R24-R49) , we need to set `torch.compile()` backend to `eager` to disable compile, using default pytorch way. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-03-21 15:55:51 +08:00
wangxiyuan	befbee5883	Update README and add collect_env info (#369 ) 1. Doc: Fix error link 2. Doc: make Chinese version the same with english 3. remove useless file `test.py` 4. update `collect_env.py` 5. Fix v1 import error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-21 15:43:43 +08:00
Shanshan Shen	c06af8b2e0	[V1][Core] Add support for V1 Engine (#295 ) ### What this PR does / why we need it? Add support for V1 Engine. Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us. ### Does this PR introduce _any_ user-facing change? To use V1 Engine on NPU device, you need to set the env variable shown below: ```bash export VLLM_USE_V1=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` If you are using vllm for offline inferencing, you must add a `__main__` guard like: ```bash if __name__ == '__main__': llm = vllm.LLM(...) ``` Find more details [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing). ### How was this patch tested? I have tested the online serving with `Qwen2.5-7B-Instruct` using this command: ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 ``` Query the model with input prompts: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-03-20 19:34:44 +08:00
Angazenn	7330416de3	[BugFix] Fix bugs when using ascend quantization (#275 ) ### What this PR does / why we need it? It fixes following bugs: 1. When searching a specific linear quantization implementation from a tool (such as MindIE-Turbo), the mapping of packed linear is required to identify correponding quant type. 2. The exception is narrowed down to ImportError when importing MindIETurboQuantizer to better throw other errors. 3. The api of AscendKVCacheMethod.apply is aligned with that in AscendAttentionBackendImpl. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/d63804cf-c060-451f-9cb0-d012e06b5333) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-03-12 11:33:21 +08:00
Mengqing Cao	5c7a95b01d	[Attn] Support encoder-only attention with torch sdpa (#290 ) ### What this PR does / why we need it? Support encoder-only attention with torch sdpa fix https://github.com/vllm-project/vllm-ascend/pull/229#issuecomment-2695942741 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test locally with `pytest vllm-project/vllm/tests/entrypoints/openai/test_score.py` Note: Since torch compile on npu are still work in process, we need to comment the following code to make UT run: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L138 result: ```bash /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ================================================================================== test session starts =================================================================================== platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 rootdir: /home/xxx/code/vllm-cpu/vllm configfile: pyproject.toml plugins: shard-0.1.2, rerunfailures-15.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 8 items Running 8 items in this shard tests/entrypoints/openai/test_score.py ........ [100%] ==================================================================================== warnings summary ==================================================================================== ../../../miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8 /home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================== 8 passed, 1 warning in 131.42s (0:02:11) ======================================================================== ``` This ut will be included in CI when torch compile feature is done. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-12 08:57:29 +08:00
zouyida2002	12aa7115b5	bugfix for qwen2_vl (#301 ) ### What this PR does / why we need it? this pr fixes the error while inferring Qwen2_VL. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is equal to gpu. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-12 08:39:50 +08:00

1 2

84 Commits