This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.