CUDA: optimize MMQ int8 tensor core performance (#8062)

* CUDA: optimize MMQ int8 tensor core performance

* only a single get_mma_tile_x_k function

* simplify code, make functions constexpr

This commit is contained in:

Johannes Gäßler

2024-06-24 12:41:23 +02:00

committed by

GitHub

parent 52fc8705a0

commit 9a590c8226

3 changed files with 902 additions and 570 deletions

File diff suppressed because it is too large Load Diff