CUDA: optimize MMQ int8 tensor core performance (#8062)

* CUDA: optimize MMQ int8 tensor core performance

* only a single get_mma_tile_x_k function

* simplify code, make functions constexpr
This commit is contained in:
Johannes Gäßler
2024-06-24 12:41:23 +02:00
committed by GitHub
parent 52fc8705a0
commit 9a590c8226
3 changed files with 902 additions and 570 deletions

File diff suppressed because it is too large Load Diff