enginex-ascend-910-llama.cpp

EngineX-Ascend/enginex-ascend-910-llama.cpp

Author	SHA1	Message	Date
Jeff Bolz	fb349848f3	vulkan: Handle FA with all -inf mask values (#16447 )	2025-10-20 22:16:08 -05:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Jeff Bolz	94e82c7ead	vulkan: clamp matmul and FA results to the max finite value (#15652 ) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16	2025-08-31 08:27:57 +02:00
Jeff Bolz	c9a24fb932	vulkan: Support FA with any multiple of 8 head sizes (#15537 ) The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.	2025-08-24 11:24:25 +02:00
Jeff Bolz	c4f53563df	vulkan: support fattn sinks (#15126 )	2025-08-07 22:44:20 +02:00
Jeff Bolz	a0374a67e2	vulkan: Handle updated FA dim2/3 definition (#14518 ) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-05 09:26:04 +02:00
Jeff Bolz	2b72bedec1	vulkan: support mixed/deepseekR1 FA head sizes (#14509 ) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-03 20:21:14 +02:00
Jeff Bolz	8875523eb3	vulkan: support softmax/FA batch and broadcast (#14449 )	2025-07-02 15:48:33 +03:00
Jeff Bolz	2f5a4e1e09	vulkan: move common FA code to flash_attn_base.comp (#13556 ) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-17 09:14:55 +02:00
Jeff Bolz	015022bb53	vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931 ) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.	2025-04-16 20:37:25 +02:00
Jeff Bolz	a4837577aa	vulkan: use aligned loads for flash attention mask (#12853 ) Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.	2025-04-12 10:44:48 +02:00
Jeff Bolz	7ecd780b1a	vulkan: Use fp16 for the flash attention P*V multiplication (#12783 ) This is consistent with the ggml-cuda behavior and the mul_mat fallback.	2025-04-09 07:12:57 +02:00
Jeff Bolz	0c74b04376	vulkan: fix NaN issue in flash attention shader (#12776 ) Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.	2025-04-06 11:03:47 +02:00
Jeff Bolz	80b717d493	vulkan: Use unclamped loads for flash attention mask (#12720 ) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-06 10:47:13 +02:00
Jeff Bolz	f01bd02376	vulkan: Implement split_k for coopmat2 flash attention. (#12627 ) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-02 14:25:08 -05:00
Jeff Bolz	be0a0f8cae	vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559 ) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-02 19:40:32 +02:00
Rémy O	fc1b0d0936	vulkan: initial support for IQ1_S and IQ1_M quantizations (#11528 ) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-15 09:01:40 +01:00
Rémy O	8a7e3bf17a	vulkan: initial support for IQ4_XS quantization (#11501 )	2025-02-06 07:09:59 +01:00
Rémy Oudompheng	66ee4f297c	vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360 ) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-01-29 18:29:39 +01:00
Jeff Bolz	aea8ddd516	vulkan: fix coopmat2 validation failures (#11284 ) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-01-20 10:38:32 -06:00
Jeff Bolz	44e18ef939	vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281 ) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-01-18 09:26:50 +01:00
Jeff Bolz	c9c6e01dae	vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (#10206 )	2024-12-05 20:15:05 +01:00

23 Commits