Yibo Cai
5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
..
2025-02-28 14:41:47 +01:00
2024-11-14 18:04:35 +01:00
2025-05-13 18:02:28 +03:00
2025-05-02 19:53:12 +03:00
2025-03-30 08:33:31 +03:00
2025-03-30 08:33:31 +03:00
2025-05-13 18:02:28 +03:00
2025-03-30 08:33:31 +03:00
2025-04-21 18:13:51 +02:00
2025-05-07 17:28:36 +03:00
2024-12-07 14:37:50 +02:00
2024-12-07 14:37:50 +02:00
2024-12-07 14:37:50 +02:00
2025-04-10 01:00:34 +02:00
2025-05-14 21:53:52 +02:00
2024-11-14 18:04:35 +01:00
2024-12-07 14:37:50 +02:00
2024-12-07 14:37:50 +02:00
2025-05-14 21:53:52 +02:00
2025-05-04 21:25:43 +02:00
2025-05-07 17:28:36 +03:00
2025-04-24 17:32:47 +03:00
2025-04-30 13:17:08 +02:00
2025-03-30 08:33:31 +03:00
2025-03-30 08:33:31 +03:00
2025-05-07 17:28:36 +03:00
2025-04-07 18:44:17 +03:00