* run one test per backend/device (even if it's the same device)
* test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci
* llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>