xc-llm-ascend

Author	SHA1	Message	Date
whx	20dff4deff	[Scheduler] Add AscendScheduler. (#543 ) This PR adds AscendScheduler to vllm v1 engine. This scheduler currently supports v0-style prefill-first scheduling strategy. In the future more schedule methods will be supported by this scheduler. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-04-17 19:31:50 +08:00
paulyu12	697908f5cd	[Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521 ) ### What this PR does / why we need it? According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop #396](https://github.com/vllm-project/vllm-ascend/issues/396) and this [vLLM Ascend Roadmap Q2 2025 #448](https://github.com/vllm-project/vllm-ascend/issues/448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>	2025-04-17 16:48:46 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
Huazhong Ji	c3d1a3782a	Add pyhccl (#503 ) This is the first step to support trl vllm serve on Ascend NPU https://github.com/vllm-project/vllm-ascend/issues/459. This PR can work properly only when https://github.com/vllm-project/vllm/pull/16464 is merged into vLLM. --------- Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>	2025-04-17 14:57:52 +08:00
Mengqing Cao	6061f33670	[Bugfix][Model] Fix api in DeepSeek model (#545 ) ### What this PR does / why we need it? Fix api in DeepSeekV2, aligning with the latest code of the main branch in vllm. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test locally with deepseek-v2-lite, and will add CI by @Potabk. Plz update the model UT after this pr is merged, thx! cc @Potabk Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-17 11:56:05 +08:00
Shanshan Shen	415ed027fa	[V1][Platform] Remove `supports_structured_output()` in platform (#531 ) ### What this PR does / why we need it? Remove `supports_structured_output()` in platform. This method is no need, because upstream has deleted this. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-16 09:30:33 +08:00
wangxiyuan	bbe7ccd366	[MISC] Add patch module (#526 ) This PR added patch module for vllm 1. platform patch: the patch will be registered when load the platform 2. worker patch: the patch will be registered when worker is started. The detail is: 1. patch_common: patch for main and 0.8.4 version 4. patch_main: patch for main verison 5. patch_0_8_4: patch for 0.8.4 version	2025-04-16 09:28:58 +08:00
Shanshan Shen	bcbc04f92b	[Doc] Add environment variables doc (#519 ) ### What this PR does / why we need it? Add environment variables doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-15 16:09:36 +08:00
eeethenQ	44a8301424	[Feature] Add PD separation feature (#432 ) ### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>	2025-04-15 15:11:35 +08:00
wangxiyuan	c7f6584d75	[V1] clean up V1 code (#505 ) Clean up V1 code: 1. remove useless code. 2. format code to be clear. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:24:02 +08:00
wangxiyuan	f6af1d2471	[MISC] fix logger (#515 ) logger in vllm-ascend doesn't work. This PR fix the issue. Fix: https://github.com/vllm-project/vllm-ascend/issues/431 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:18:05 +08:00
wangxiyuan	9c7428b3d5	[CI] enable custom ops build (#466 ) ### What this PR does / why we need it? This PR enable custom ops build by default. ### Does this PR introduce _any_ user-facing change? Yes, users now install vllm-ascend from source will trigger custom ops build step. ### How was this patch tested? By image build and e2e CI --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-12 10:24:53 +08:00
Mengqing Cao	f6cf92e7d5	[quant][bugfix] fix deepseek quant bug (#478 ) see #465 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2025-04-08 09:15:56 +08:00
Shanshan Shen	1d88dacf9f	[V1][Platform] Add `supports_structured_output()` method to Platform (#475 ) ### What this PR does / why we need it? Add `supports_structured_output()` method to Platform, find more details at https://github.com/vllm-project/vllm/pull/16148. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-07 19:11:51 +08:00
Mengqing Cao	344228a5da	[deepseek][bugfix] support deepseek quant (#469 ) - support deepseek quant - add w8a8_dynamic quant see #391 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2025-04-07 10:56:12 +08:00
Li Wang	3f9752f8ee	[Bugfix]Lazy import vllm config (#462 ) ### What this PR does / why we need it? Lazy import vllm config to avoid circular imports --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-03 16:03:08 +08:00
Pleaplusone	ce8259975e	[core] Support custom ascendc kernels in vllm-ascend (#233 ) This PR add custom ascendc kernel rotary_embedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. Related: https://github.com/vllm-project/vllm-ascend/issues/156 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-03 14:52:34 +08:00
Shanshan Shen	14d9a64047	[ModelRunner][V1] Optimize V1 attention mask (#442 ) ### What this PR does / why we need it? Pre-construct a mask matrix to improve the efficiency of attention mask construction during inference. Note that the length of the matrix needs to be carefully balanced: a matrix that is too large will consume excessive VRAM, while a matrix that is too small will require dynamic concatenation during inference, leading to performance degradation. Therefore, an environment variable is added here to dynamically set the size of the pre-constructed mask matrix based on requirements. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-04-02 10:33:53 +08:00
Mengqing Cao	2dbd763584	[CI] Fix mypy CI (#443 ) ### What this PR does / why we need it? Fix CI by updating mypy and pining numpy version _the modification of model_runner_v1 is just to make CI happy_ ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-01 09:25:33 +08:00
wangxiyuan	31f29b9f30	[Core] Make V1 work and enable V1 engine test (#389 ) 1. Make sure the version is string before parse in collect_env 2. Add basic V1 engine test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-28 19:34:23 +08:00
wuhuikx	57a84bb7be	[Bug Fix] Fix bug of platform for parameter checking (#411 ) Fix bug in platform.py to avoid the None value of config parameters. Signed-off-by: wuhuikx <wuhui_csu@163.com>	2025-03-28 16:31:27 +08:00
Tony	b1557abab6	fix multistep bug,remove uselesscodes (#355 ) 1. remove useluss code in attention.py 2. multistep now using StatefulModelInputForNPU and do not use StatefulModelInput Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>	2025-03-28 09:55:35 +08:00
BAI Fan	122505208f	FastPatch: Optimized Patch Embedding for Qwen2VL (#345 ) ### What this PR does / why we need it? We proposed the FastPatch method, which optimized patch embedding (Conv3D) for Qwen2VL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is better than original patch_embed layer. --------- Signed-off-by: baifanxxx <baifanxxx@gmail.com> Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>	2025-03-26 14:28:20 +08:00
Shanshan Shen	89ca63a2c2	[Bugfix] Disable torch.compile() (#370 ) ### What this PR does / why we need it? To resolve this [patch](https://github.com/vllm-project/vllm-ascend/pull/236/files#diff-43b96b39b5a52fe209d86449ad703a7ff5e1349ebaf1aa12ece8d82163ee5b61R24-R49) , we need to set `torch.compile()` backend to `eager` to disable compile, using default pytorch way. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-03-21 15:55:51 +08:00
wangxiyuan	befbee5883	Update README and add collect_env info (#369 ) 1. Doc: Fix error link 2. Doc: make Chinese version the same with english 3. remove useless file `test.py` 4. update `collect_env.py` 5. Fix v1 import error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-21 15:43:43 +08:00
Shanshan Shen	c06af8b2e0	[V1][Core] Add support for V1 Engine (#295 ) ### What this PR does / why we need it? Add support for V1 Engine. Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us. ### Does this PR introduce _any_ user-facing change? To use V1 Engine on NPU device, you need to set the env variable shown below: ```bash export VLLM_USE_V1=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` If you are using vllm for offline inferencing, you must add a `__main__` guard like: ```bash if __name__ == '__main__': llm = vllm.LLM(...) ``` Find more details [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing). ### How was this patch tested? I have tested the online serving with `Qwen2.5-7B-Instruct` using this command: ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 ``` Query the model with input prompts: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-03-20 19:34:44 +08:00
Angazenn	7330416de3	[BugFix] Fix bugs when using ascend quantization (#275 ) ### What this PR does / why we need it? It fixes following bugs: 1. When searching a specific linear quantization implementation from a tool (such as MindIE-Turbo), the mapping of packed linear is required to identify correponding quant type. 2. The exception is narrowed down to ImportError when importing MindIETurboQuantizer to better throw other errors. 3. The api of AscendKVCacheMethod.apply is aligned with that in AscendAttentionBackendImpl. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/d63804cf-c060-451f-9cb0-d012e06b5333) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-03-12 11:33:21 +08:00
Mengqing Cao	5c7a95b01d	[Attn] Support encoder-only attention with torch sdpa (#290 ) ### What this PR does / why we need it? Support encoder-only attention with torch sdpa fix https://github.com/vllm-project/vllm-ascend/pull/229#issuecomment-2695942741 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test locally with `pytest vllm-project/vllm/tests/entrypoints/openai/test_score.py` Note: Since torch compile on npu are still work in process, we need to comment the following code to make UT run: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L138 result: ```bash /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ================================================================================== test session starts =================================================================================== platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 rootdir: /home/xxx/code/vllm-cpu/vllm configfile: pyproject.toml plugins: shard-0.1.2, rerunfailures-15.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 8 items Running 8 items in this shard tests/entrypoints/openai/test_score.py ........ [100%] ==================================================================================== warnings summary ==================================================================================== ../../../miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8 /home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================== 8 passed, 1 warning in 131.42s (0:02:11) ======================================================================== ``` This ut will be included in CI when torch compile feature is done. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-12 08:57:29 +08:00
zouyida2002	12aa7115b5	bugfix for qwen2_vl (#301 ) ### What this PR does / why we need it? this pr fixes the error while inferring Qwen2_VL. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is equal to gpu. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-12 08:39:50 +08:00
yiz-liu	0db6670bfa	[Feature] Implement EP-compatible fused_moe (#121 ) ### What this PR does / why we need it? Enable Expert-Parallel for ascend devices. ### Does this PR introduce _any_ user-facing change? Enable EP add `enable_expert_parallel=True` in your offline inference scripts, like this: ```python llm = LLM( model="/path/to/model", trust_remote_code=True, tensor_parallel_size=4, max_model_len=4096, enforce_eager=True, distributed_executor_backend="mp", enable_expert_parallel=True, ) ``` ### How was this patch tested? Please use the `main` branch of vLLM. --------- Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>	2025-03-11 21:08:02 +08:00
Tony	4c9d78a035	support multistep decode (#299 ) Add multi step scheduler support for vllm-ascend Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>	2025-03-11 19:20:06 +08:00
whx	feb6bdb12e	[Platform][Model Runner] Add hash of request_ids; Change blocksize back to 128. (#293 ) This PR changes the initial value of blocksize back to 128 and adds hash value of request id list in model runner for implementing sampling param cache in sampler. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-03-11 18:50:28 +08:00
zouyida2002	faf8cd89cb	register qwen2_vl to rewrite qwen2_vl forwad (#241 ) Add qwen2-vl ascend impletation. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-07 15:41:47 +08:00
Angazenn	3217f0d10f	[Feature] Modify description and api for ascend quantization (#243 ) ### What this PR does / why we need it? 1. It adds more description for classes in quant_config.py 2. It renames AscendQKVQuantAttentionMethod to AscendKVCacheMethod to align with vLLM naming style. 3. It modifies the process when AscendLinearMethod or AscendKVCacheMethod calls create_weights. ### Does this PR introduce _any_ user-facing change? Yes. When creating weights, now AscendLinearMethod uses get_weight, get_pertensor_param and get_perchannel_param api from linear quant implementation, while AscendKVCacheMethod passes layer into linear quant implementation. ### How was this patch tested? By performing offline inference --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-03-06 15:17:25 +08:00
HongtaoYang	dcd0005058	[Fix] Remove npu_group_topk before CANN version update (#242 ) Remove npu_group_topk before CANN version update. Signed-off-by: SidaoY <1024863041@qq.com>	2025-03-06 09:02:46 +08:00
whx	0d3463400a	[Performance] Change the shape of kv_cache to avoid view of k_cache and v_cache. (#204 ) This PR changes the shape of kv cache to avoid the view of k_cache and v_cache. What's more, cache the metadata of k_cache and v_cache to avoid duplicative slice operations to improve performance. Signed-off-by: hw_whx <wanghexiang7@huawei.com>	2025-03-05 10:51:07 +08:00
Shanshan Shen	503f5045ff	[ModelRunner] Remove redundant profile_run() in model runner (#224 ) ### What this PR does / why we need it? Remove redundant `profile_run()` in model runner. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-03-04 16:58:33 +08:00
wangxiyuan	ae49bfd13a	[Core] Support pooling (#229 ) This PR added pooling support for vllm-ascend Tested with `bge-base-en-v1.5` by encode: ``` from vllm import LLM # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create an LLM. model = LLM(model="./bge-base-en-v1.5", enforce_eager=True) # Generate embedding. The output is a list of EmbeddingRequestOutputs. outputs = model.encode(prompts) # Print the outputs. for output in outputs: print(output.outputs.embedding) # list of 4096 floats ``` Tested by embedding: ``` from vllm import LLM, SamplingParams llm = LLM(model="./bge-base-en-v1.5", task="embed") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding print(f"Embeddings: {embeds!r} (size={len(embeds)})") ``` Related: https://github.com/vllm-project/vllm-ascend/issues/200 ## Known issue The accuracy is not correct since this feature rely on `enc-dec` support. It'll be done in the following PR by @MengqingCao Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-04 15:59:34 +08:00
Mengqing Cao	b64ee7d346	[Dist] Set device as rank (#202 ) ### What this PR does / why we need it? The rank returned by `torch.distributed.get_rank(device_group)` is the local rank, but rank (or rank in process group (PG)) is expected. Thus we change to use `torch.npu.current_device()` to set device ```python # difference between `local_rank` and `rank_in_group`: # if we have a group of size 4 across two nodes: # Process \| Node \| Rank \| Local Rank \| Rank in Group # 0 \| 0 \| 0 \| 0 \| 0 # 1 \| 0 \| 1 \| 1 \| 1 # 2 \| 1 \| 2 \| 0 \| 2 # 3 \| 1 \| 3 \| 1 \| 3 ``` Tested by @wwfu109 with `vllm/tests/distributed/test_customops::test_multi_process_tensor_parallel_pipeline_parallel` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-03 09:23:13 +08:00
whx	14bca9911a	[CI] Fix unsolved bugs caused by pta api change. (#190 ) This PR fix some unsolved bugs caused by pta api change. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-02-27 19:52:28 +08:00
HongtaoYang	1715230867	[CI] Upgrade to newest pta.(MLA and FusedMoE) (#189 ) Upgrade to newest pta.(MLA and FusedMoE) --------- Signed-off-by: SidaoY <1024863041@qq.com>	2025-02-27 18:50:52 +08:00
Li Wang	c131e43e7d	[Worker]Lazy import torch_npu (#184 ) ### What this PR does / why we need it? To avoid unnecessary delays, we only import torch_npu when profilling is enabled. Signed-off-by: wangli <wangli858794774@gmail.com>	2025-02-27 16:52:11 +08:00
wangxiyuan	6042c210bc	[CI] upgrade to newest pta (#187 ) Upgrade to newest torch-npu Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-02-27 16:40:23 +08:00
Mengqing Cao	fd18ae6494	[MOE] fix #176 (#179 ) Fix #176 We need to set `topk_group` and `num_expert_group` to `0` if they are `None` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-27 14:21:08 +08:00
Shanshan Shen	ee43179767	[ModelRunner] Fix cuda hard code in model runner (#155 ) ### What this PR does / why we need it? 1. Fix cuda hard code in model runner. 2. Fix tutorials doc rendering error. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-02-27 14:16:46 +08:00
zouyida2002	94cd66bba7	[CI][UT]enable multimodal ut (#158 ) enable multimodal ut --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-02-27 14:14:43 +08:00
Mengqing Cao	1c238b930d	[worker] remove unused assertion (#161 ) ### What this PR does / why we need it? Remove unused assertion in `NPUWorker`, as this has been moved to `Executor` in vLLM: `aabeb2688f/vllm/executor/uniproc_executor.py (L43)` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-26 16:11:36 +08:00
Mengqing Cao	7776f2e6a4	[ModelRunner] remove padding for vlm inputs (#150 ) ### What this PR does / why we need it? Remove padding for vlm inputs. We don't need padding inputs now, this padding will break the input preparetion of VLMs. ### Does this PR introduce _any_ user-facing change? N/A Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-26 10:26:39 +08:00
Mengqing Cao	79fbb20b4d	[ModelRunner] remove unused args (follow vllm changes) (#159 ) ### What this PR does / why we need it? The arg list of `Attention.forward()` is changed by https://github.com/vllm-project/vllm/pull/13555. The unused args `kv_caches` and `attn_metadata` are removed. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-25 17:51:09 +08:00
Yaphets24	d0b3cb4fa7	modify:Eliminate redundant operations in the code to improve performance (#137 ) ### What this PR does / why we need it? Eliminate redundant operations in the code to improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yaphets24 <d_mym0618@163.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-02-22 17:43:42 +08:00

1 2

63 Commits