Support loading of larger models with on-the-fly quantization (#3061)
This commit is contained in:
@@ -317,6 +317,7 @@ class ServerArgs:
|
||||
"dummy",
|
||||
"gguf",
|
||||
"bitsandbytes",
|
||||
"layered",
|
||||
],
|
||||
help="The format of the model weights to load. "
|
||||
'"auto" will try to load the weights in the safetensors format '
|
||||
@@ -330,7 +331,10 @@ class ServerArgs:
|
||||
"which is mainly for profiling."
|
||||
'"gguf" will load the weights in the gguf format. '
|
||||
'"bitsandbytes" will load the weights using bitsandbytes '
|
||||
"quantization.",
|
||||
"quantization."
|
||||
'"layered" loads weights layer by layer so that one can quantize a '
|
||||
"layer before loading another to make the peak memory envelope "
|
||||
"smaller.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--trust-remote-code",
|
||||
|
||||
Reference in New Issue
Block a user