rms_norm_mul
* opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul`