* ci : faster CUDA toolkit installation method and use ccache * remove fetch-depth * only pack CUDA runtime on master