uvos
ad4a700117
HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes ( #14949 )
2025-07-30 17:38:06 +02:00
deepsek
66906cd82a
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 ( #14624 )
...
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-27 00:28:14 +02:00
Johannes Gäßler
4696d56749
CUDA: fix crash on large batch size for quant. MoE ( #13537 )
2025-05-14 16:41:02 +02:00
Johannes Gäßler
7474e00b34
CUDA: fix crash with partial offloading of MoE ( #13439 )
2025-05-11 16:09:33 +02:00
Johannes Gäßler
2356fb1d53
CUDA: fix bad asserts for partial offload ( #13337 )
2025-05-06 13:58:51 +02:00
Johannes Gäßler
15a28ec8c7
CUDA: fix --split-mode row for MMQ ( #13323 )
2025-05-06 08:36:46 +02:00
Johannes Gäßler
9070365020
CUDA: fix logic for clearing padding with -ngl 0 ( #13320 )
2025-05-05 22:32:13 +02:00
Johannes Gäßler
e1e8e0991f
CUDA: batched+noncont MMQ, refactor bs>1 MoE code ( #13199 )
2025-04-30 23:12:59 +02:00
Slobodan Josic
bd40678df7
HIP: Add support for RDNA4 targets ( #12372 )
2025-03-26 23:46:30 +01:00
R0CKSTAR
7ea75035b6
CUDA: Fix clang warnings ( #12540 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-24 11:28:34 +01:00
R0CKSTAR
fac63a3d78
musa: refine compute capability ( #12493 )
...
* musa: refine compute capability
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-22 10:11:37 +01:00
uvos
5c4284d57b
HIP: Remove GCN from list of devices that avoid MMQ ( #11831 )
2025-02-12 22:25:28 +01:00
Johannes Gäßler
b9ab0a4d0b
CUDA: use arch list for compatibility check ( #11775 )
...
* CUDA: use arch list for feature availability check
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-11 00:17:22 +01:00
uvos
4d0598e144
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other ( #11601 )
...
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-02 22:08:05 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-02 19:31:09 +01:00
Andreas Kieslinger
750cb3e246
CUDA: rename macros to avoid conflicts with WinAPI ( #10736 )
...
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)
* Reverts erroneous rename in SYCL-code.
* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.
* Renames the rest of the compute capability macros for consistency.
2024-12-10 18:23:24 +01:00
uvos
3ad5451f3b
Add some minimal optimizations for CDNA ( #10498 )
...
* Add some minimal optimizations for CDNA
* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-11-27 17:10:08 +01:00
Johannes Gäßler
c39665f589
CUDA: fix MMQ for non-contiguous src0, add tests ( #10021 )
...
* CUDA: fix MMQ for non-contiguous src0, add tests
* revise test code
2024-10-24 11:09:36 +02:00
Johannes Gäßler
5af118efda
CUDA: fix --split-mode row race condition ( #9413 )
2024-09-11 10:22:40 +02:00
slaren
2b1f616b20
ggml : reduce hash table reset cost ( #8698 )
...
* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Johannes Gäßler
69c487f4ed
CUDA: MMQ code deduplication + iquant support ( #8495 )
...
* CUDA: MMQ code deduplication + iquant support
* 1 less parallel job for CI build
2024-07-20 22:25:26 +02:00
Johannes Gäßler
8e558309dc
CUDA: MMQ support for iq4_nl, iq4_xs ( #8278 )
2024-07-05 09:06:31 +02:00
Georgi Gerganov
f3f65429c4
llama : reorganize source code + improve CMake ( #8006 )
...
* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122 )
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <slarengh@gmail.com >
2024-06-26 18:33:02 +03:00