* CUDA: mul_mat_v support for batch sizes > 1 * use 64 bit math for initial offset calculation
ggml_roll