xc-llm-ascend

Author	SHA1	Message	Date
Pleaplusone	ce8259975e	[core] Support custom ascendc kernels in vllm-ascend (#233 ) This PR add custom ascendc kernel rotary_embedding support in vllm-ascend, related CMakeLists and setuptools is also added in this PR. Related: https://github.com/vllm-project/vllm-ascend/issues/156 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-03 14:52:34 +08:00
Shanshan Shen	14d9a64047	[ModelRunner][V1] Optimize V1 attention mask (#442 ) ### What this PR does / why we need it? Pre-construct a mask matrix to improve the efficiency of attention mask construction during inference. Note that the length of the matrix needs to be carefully balanced: a matrix that is too large will consume excessive VRAM, while a matrix that is too small will require dynamic concatenation during inference, leading to performance degradation. Therefore, an environment variable is added here to dynamically set the size of the pre-constructed mask matrix based on requirements. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-04-02 10:33:53 +08:00
hfadzxy	94bf9c379e	[Doc]Add developer guide for using lm-eval (#456 ) ### What this PR does / why we need it? Add developer guide for using lm-eval ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test manually --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-01 23:43:51 +08:00
dependabot[bot]	78083d405e	Bump actions/setup-python from 5.4.0 to 5.5.0 (#440 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.4.0 to 5.5.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-01 14:34:33 +08:00
Mengqing Cao	2dbd763584	[CI] Fix mypy CI (#443 ) ### What this PR does / why we need it? Fix CI by updating mypy and pining numpy version _the modification of model_runner_v1 is just to make CI happy_ ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-01 09:25:33 +08:00
Yikun Jiang	c42e21a5aa	[Docs] Add install system dependencies in install doc (#438 ) ### What this PR does / why we need it? Add install system dependencies in install doc Resolve: ``` $ pip install vllm==v0.7.3 CMake Error at CMakeLists.txt:14 (project): No CMAKE_CXX_COMPILER could be found. Tell CMake where to find the compiler by setting either the environment variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH. // ... ... note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for vllm Failed to build vllm ERROR: Failed to build installable wheels for some pyproject.toml based projects (vllm) ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/439 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-31 14:17:55 +08:00
hfadzxy	7beb4339dc	[Doc]Add developer guide for using OpenCompass (#368 ) ### What this PR does / why we need it? Add developer guide for using OpenCompass ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test manually --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-31 00:24:25 +08:00
wangxiyuan	b6499ed97d	[CI] Use CI pool (#428 ) Use CI pool instead of self-host for e2e test to speed up CI. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-29 12:42:59 +08:00
wangxiyuan	ca8b1c3e47	[Doc] Add 0.7.3rc2 release note (#419 ) Add 0.7.3rc2 release note. We'll release 0.7.3rc2 right now. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-29 09:02:08 +08:00
wangxiyuan	31f29b9f30	[Core] Make V1 work and enable V1 engine test (#389 ) 1. Make sure the version is string before parse in collect_env 2. Add basic V1 engine test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-28 19:34:23 +08:00
wuhuikx	57a84bb7be	[Bug Fix] Fix bug of platform for parameter checking (#411 ) Fix bug in platform.py to avoid the None value of config parameters. Signed-off-by: wuhuikx <wuhui_csu@163.com>	2025-03-28 16:31:27 +08:00
Tony	b1557abab6	fix multistep bug,remove uselesscodes (#355 ) 1. remove useluss code in attention.py 2. multistep now using StatefulModelInputForNPU and do not use StatefulModelInput Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>	2025-03-28 09:55:35 +08:00
Yikun Jiang	1864c40520	Add vLLM Ascend Weekly meeting link (#400 ) ### What this PR does / why we need it? Add vLLM Ascend Weekly meeting link ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-27 09:00:21 +08:00
Zhenyu Zheng	4804b74e95	Update 110-user-story.yml (#402 ) Fix a few typos in issue template Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>	2025-03-27 08:58:57 +08:00
Zhenyu Zheng	0b5a9643fd	Add an example for user stories (#399 ) Add an example for user stories and fix some typo Add a new section, user story in the docs, to collect user stories of llvm-ascend, also add an example and the issue template to collect user story Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>	2025-03-26 16:25:57 +08:00
BAI Fan	122505208f	FastPatch: Optimized Patch Embedding for Qwen2VL (#345 ) ### What this PR does / why we need it? We proposed the FastPatch method, which optimized patch embedding (Conv3D) for Qwen2VL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is better than original patch_embed layer. --------- Signed-off-by: baifanxxx <baifanxxx@gmail.com> Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>	2025-03-26 14:28:20 +08:00
Mengqing Cao	d4accf4ec2	[Doc][Model] update LLaVA 1.6 support (#373 ) update LLaVA 1.6 support --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-26 09:07:55 +08:00
Mengqing Cao	6295d2e9bc	[CI/Build][Doc] upgrade torch-npu to 0320 (#392 ) ### What this PR does / why we need it? This pr upgrades torch-npu to 0320, so that #321, https://github.com/vllm-project/vllm-ascend/issues/267#issuecomment-2745045743 could be fixed, and #372 should be reverted after this pr ### Does this PR introduce _any_ user-facing change? upgrade torch-npu to 0320 ### How was this patch tested? tested locally with long seq inferencing. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-26 09:04:12 +08:00
Shanshan Shen	3fb3b5cf75	[Doc] Update model support doc (add QwQ-32B) (#388 ) ### What this PR does / why we need it? Update model support doc (add QwQ-32B) Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-03-25 11:40:50 +08:00
Mengqing Cao	8996733307	[CI] fix vllm test (#365 ) fix vllm test Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-24 16:09:06 +08:00
Shanshan Shen	89ca63a2c2	[Bugfix] Disable torch.compile() (#370 ) ### What this PR does / why we need it? To resolve this [patch](https://github.com/vllm-project/vllm-ascend/pull/236/files#diff-43b96b39b5a52fe209d86449ad703a7ff5e1349ebaf1aa12ece8d82163ee5b61R24-R49) , we need to set `torch.compile()` backend to `eager` to disable compile, using default pytorch way. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-03-21 15:55:51 +08:00
Li Wang	9a175ca0fc	[Doc]Add benchmark scripts (#74 ) ### What this PR does / why we need it? The purpose of this PR is to add benchmark scripts for npu, developers can easily run performance tests on their own machines with one line of code . --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-03-21 15:54:34 +08:00
wangxiyuan	befbee5883	Update README and add collect_env info (#369 ) 1. Doc: Fix error link 2. Doc: make Chinese version the same with english 3. remove useless file `test.py` 4. update `collect_env.py` 5. Fix v1 import error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-21 15:43:43 +08:00
Yikun Jiang	243ed4da69	Add vLLM forum info and update readme (#366 ) ### What this PR does / why we need it? Add vLLM forum info and update readme ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-21 09:32:42 +08:00
Shanshan Shen	c06af8b2e0	[V1][Core] Add support for V1 Engine (#295 ) ### What this PR does / why we need it? Add support for V1 Engine. Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us. ### Does this PR introduce _any_ user-facing change? To use V1 Engine on NPU device, you need to set the env variable shown below: ```bash export VLLM_USE_V1=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` If you are using vllm for offline inferencing, you must add a `__main__` guard like: ```bash if __name__ == '__main__': llm = vllm.LLM(...) ``` Find more details [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing). ### How was this patch tested? I have tested the online serving with `Qwen2.5-7B-Instruct` using this command: ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 ``` Query the model with input prompts: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-03-20 19:34:44 +08:00
wangxiyuan	663dca7578	[CI] fix race condition problem (#353 ) fix race condition problem Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-19 17:04:36 +08:00
Shanshan Shen	441a62e937	[Doc] Fix bugs of installation doc and format tool (#330 ) ### What this PR does / why we need it? Fix bugs of installation doc and format tool. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-03-14 10:21:35 +08:00
wangxiyuan	ac1ba1d8d2	[Build] Fix x86 image build (#327 ) Install cpu version of pytorch in x86 to reduce image size Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-14 09:41:57 +08:00
wangxiyuan	c25631ec7b	[Doc] Add the release note for 0.7.3rc1 (#285 ) Add the release note for 0.7.3rc1 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-13 17:57:06 +08:00
Li Wang	41aba1cfc1	[Doc]Fix tutorial doc expression (#319 ) Fix tutorial doc expression Signed-off-by: wangli <wangli858794774@gmail.com>	2025-03-13 15:24:05 +08:00
xiemingda	59ea23d0d3	[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311 ) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com>	2025-03-12 20:37:12 +08:00
Angazenn	7330416de3	[BugFix] Fix bugs when using ascend quantization (#275 ) ### What this PR does / why we need it? It fixes following bugs: 1. When searching a specific linear quantization implementation from a tool (such as MindIE-Turbo), the mapping of packed linear is required to identify correponding quant type. 2. The exception is narrowed down to ImportError when importing MindIETurboQuantizer to better throw other errors. 3. The api of AscendKVCacheMethod.apply is aligned with that in AscendAttentionBackendImpl. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/d63804cf-c060-451f-9cb0-d012e06b5333) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-03-12 11:33:21 +08:00
Mengqing Cao	5c7a95b01d	[Attn] Support encoder-only attention with torch sdpa (#290 ) ### What this PR does / why we need it? Support encoder-only attention with torch sdpa fix https://github.com/vllm-project/vllm-ascend/pull/229#issuecomment-2695942741 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test locally with `pytest vllm-project/vllm/tests/entrypoints/openai/test_score.py` Note: Since torch compile on npu are still work in process, we need to comment the following code to make UT run: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L138 result: ```bash /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ================================================================================== test session starts =================================================================================== platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 rootdir: /home/xxx/code/vllm-cpu/vllm configfile: pyproject.toml plugins: shard-0.1.2, rerunfailures-15.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 8 items Running 8 items in this shard tests/entrypoints/openai/test_score.py ........ [100%] ==================================================================================== warnings summary ==================================================================================== ../../../miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8 /home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================================================== 8 passed, 1 warning in 131.42s (0:02:11) ======================================================================== ``` This ut will be included in CI when torch compile feature is done. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-12 08:57:29 +08:00
zouyida2002	12aa7115b5	bugfix for qwen2_vl (#301 ) ### What this PR does / why we need it? this pr fixes the error while inferring Qwen2_VL. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is equal to gpu. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-12 08:39:50 +08:00
wangxiyuan	9450e9811b	[CI] Uninstall triton in dockerfile (#298 ) triton doesn't work with ascend. We should make sure it's uninstalled in dockerfile Related: https://github.com/vllm-project/vllm-ascend/issues/291 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-12 07:14:57 +08:00
yiz-liu	0db6670bfa	[Feature] Implement EP-compatible fused_moe (#121 ) ### What this PR does / why we need it? Enable Expert-Parallel for ascend devices. ### Does this PR introduce _any_ user-facing change? Enable EP add `enable_expert_parallel=True` in your offline inference scripts, like this: ```python llm = LLM( model="/path/to/model", trust_remote_code=True, tensor_parallel_size=4, max_model_len=4096, enforce_eager=True, distributed_executor_backend="mp", enable_expert_parallel=True, ) ``` ### How was this patch tested? Please use the `main` branch of vLLM. --------- Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>	2025-03-11 21:08:02 +08:00
Tony	4c9d78a035	support multistep decode (#299 ) Add multi step scheduler support for vllm-ascend Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>	2025-03-11 19:20:06 +08:00
whx	feb6bdb12e	[Platform][Model Runner] Add hash of request_ids; Change blocksize back to 128. (#293 ) This PR changes the initial value of blocksize back to 128 and adds hash value of request id list in model runner for implementing sampling param cache in sampler. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-03-11 18:50:28 +08:00
Yikun Jiang	007aeaa48b	[Doc] Change distributed_executor_backend to mp (#287 ) ### What this PR does / why we need it? Fix `ValueError: Unrecognized distributed executor backend tp. Supported values are 'ray', 'mp' 'uni', 'external_launcher' or custom ExecutorBase subclass.` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on my local node Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-10 11:27:26 +08:00
Yikun Jiang	38334f5daa	[Docs] Re-arch on doc and make QwQ doc work (#271 ) ### What this PR does / why we need it? Re-arch on tutorials, move singe npu / multi npu / multi node to index. - Unifiy docker run cmd - Use dropdown to hide build from source installation doc - Re-arch tutorials to include Qwen/QwQ/DeepSeek - Make QwQ doc works ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-10 09:27:48 +08:00
Yikun Jiang	18bb8d1f52	Adapt vLLM requirements changes to fix main CI (#279 ) ### What this PR does / why we need it? Adapt vLLM requirements changes: `206e2577fa (diff-01ec17406c969585ed075609a2bbf2f2f4fe3e3def36946694abe6d4eb60a6f2)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-09 16:07:45 +08:00
Yikun Jiang	268da28961	Pin modelscope<1.23.0 on vLLM v0.7.3 (#272 ) ### What this PR does / why we need it? Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-09 15:59:42 +08:00
Yikun Jiang	be58d5f3d8	Bump torch_npu version to dev20250308.3 (#276 ) ### What this PR does / why we need it? Bump torch_npu version to dev20250308.3 to fix performance regression on multi-stream case: `e04c580d07` . ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-09 15:59:15 +08:00
Mengqing Cao	91f7d8115d	[CI/Build] Bump torch_npu to dev20250307.3 (#265 ) Update torch-npu version to fix torch npu exponential_ accuracy With this update, the percision issue when setting `temperature > 0` is fixed. --------- Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-03-07 20:34:07 +08:00
zouyida2002	faf8cd89cb	register qwen2_vl to rewrite qwen2_vl forwad (#241 ) Add qwen2-vl ascend impletation. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-07 15:41:47 +08:00
Yikun Jiang	35cb7b5234	[CI] Add dispatch job to leverage dynamic devices (#251 ) ### What this PR does / why we need it? Add dispatch job to leverage jobs to dynamic devices include 2 stage as below: The dispatch job will spend extra about `10s * parallel number + 30s` time to wait other job launch container and release lock. - Stage 1: Acquire lock add a dispatch job, this job use lockfile to acquire locks and then get device number dynamically - Stage 2.1: Launch container with dynamic device pass the device number via output and start the container job with dynamic device - Stage 2.2: Release lock once the job started, release the lock. In the backend, we use multiple path to setup multiple self host runners as load balancer: ``` $ pwd /home/action $ ll \| grep actions drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-01 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-02 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-03 drwx------ 6 action action 4096 Mar 7 08:56 actions-runner-04 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-05 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-06 ``` ``` adduser -G docker action su action pip3 install docker prettytable sudo yum install procmail ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - CI passed - E2E test manully, triggered 3 jobs in parallel: - [1st job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297) dispatch to /dev/davinci2. - [2nd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250) dispatch to /dev/davinci3 - [3rd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551) dispatch to /dev/davinci4 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-07 09:47:13 +08:00
Angazenn	3217f0d10f	[Feature] Modify description and api for ascend quantization (#243 ) ### What this PR does / why we need it? 1. It adds more description for classes in quant_config.py 2. It renames AscendQKVQuantAttentionMethod to AscendKVCacheMethod to align with vLLM naming style. 3. It modifies the process when AscendLinearMethod or AscendKVCacheMethod calls create_weights. ### Does this PR introduce _any_ user-facing change? Yes. When creating weights, now AscendLinearMethod uses get_weight, get_pertensor_param and get_perchannel_param api from linear quant implementation, while AscendKVCacheMethod passes layer into linear quant implementation. ### How was this patch tested? By performing offline inference --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-03-06 15:17:25 +08:00
Yikun Jiang	cff08f9df8	[Doc] Add initial FAQs (#247 ) ### What this PR does / why we need it? Add initial FAQs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-06 10:42:42 +08:00
HongtaoYang	dcd0005058	[Fix] Remove npu_group_topk before CANN version update (#242 ) Remove npu_group_topk before CANN version update. Signed-off-by: SidaoY <1024863041@qq.com>	2025-03-06 09:02:46 +08:00
whx	0d3463400a	[Performance] Change the shape of kv_cache to avoid view of k_cache and v_cache. (#204 ) This PR changes the shape of kv cache to avoid the view of k_cache and v_cache. What's more, cache the metadata of k_cache and v_cache to avoid duplicative slice operations to improve performance. Signed-off-by: hw_whx <wanghexiang7@huawei.com>	2025-03-05 10:51:07 +08:00

... 22 23 24 25 26

1275 Commits