Yibo Cai
54a2c7a8cd
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
2025-05-29 14:39:20 +03:00
..
2024-11-25 15:13:39 +01:00
2025-05-28 11:54:20 +08:00
2025-05-29 14:39:20 +03:00
2025-05-28 13:33:37 +02:00
2025-04-15 11:20:38 +02:00
2024-12-14 14:43:46 +02:00
2025-05-21 16:26:33 +02:00
2025-05-21 09:58:49 +08:00
2025-05-27 12:56:08 -07:00
2025-05-09 10:31:07 +03:00
2025-05-27 20:52:59 +05:30
2025-05-27 18:39:07 +02:00
2025-05-29 12:50:25 +02:00
2025-05-01 22:46:10 +02:00
2025-02-28 14:41:47 +01:00
2025-03-11 14:25:17 +01:00
2025-05-27 13:05:18 +02:00
2025-03-30 10:59:38 +02:00
2025-05-27 16:21:36 +03:00
2025-05-19 13:29:56 +03:00
2025-05-07 17:28:36 +03:00
2024-11-14 18:04:35 +01:00
2024-11-14 18:04:35 +01:00
2024-12-12 19:02:49 +01:00
2025-05-27 15:53:55 +02:00
2025-05-15 19:13:11 +02:00