[CPU] fix OOM when mem-fraction is not set (#9090)
This commit is contained in:
@@ -84,13 +84,13 @@ git checkout <YOUR-DESIRED-VERSION>
|
||||
# Install SGLang dependent libs, and build SGLang main package
|
||||
pip install --upgrade pip setuptools
|
||||
conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
|
||||
pip install intel-openmp
|
||||
pip install -e "python[all_cpu]"
|
||||
pip install torch==2.7.1 torchvision==0.22.1 triton==3.3.1 --force-reinstall
|
||||
|
||||
# Build the CPU backend kernels
|
||||
cd sgl-kernel
|
||||
cp pyproject_cpu.toml pyproject.toml
|
||||
pip install -v .
|
||||
pip install .
|
||||
|
||||
# Other required environment variables
|
||||
# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
|
||||
@@ -134,13 +134,17 @@ Notes:
|
||||
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
|
||||
```
|
||||
|
||||
Please beware that with SGLANG_CPU_OMP_THREADS_BIND set,
|
||||
the available memory amounts of the ranks may not be determined in prior.
|
||||
You may need to set proper `--max-total-tokens` to avoid the out-of-memory error.
|
||||
|
||||
3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`.
|
||||
To specify the maximum batch size when using torch compile, set the flag `--torch-compile-max-bs`.
|
||||
For example, `--enable-torch-compile --torch-compile-max-bs 4` means using torch compile and setting the
|
||||
maximum batch size to 4.
|
||||
|
||||
4. A warmup step is automatically triggered when the service is started.
|
||||
The server is ready when you see the log `The server is fired up and ready to roll!`.
|
||||
The server is ready when you see the log `The server is fired up and ready to roll!`.
|
||||
|
||||
## Benchmarking with Requests
|
||||
|
||||
@@ -164,7 +168,7 @@ python -m sglang.bench_serving -h
|
||||
```
|
||||
|
||||
Additionally, the requests can be formed with
|
||||
[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
|
||||
[OpenAI Completions API](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
|
||||
and sent via the command line (e.g. using `curl`) or via your own script.
|
||||
|
||||
## Example: Running DeepSeek-R1
|
||||
@@ -180,7 +184,6 @@ python -m sglang.launch_server \
|
||||
--quantization w8a8_int8 \
|
||||
--host 0.0.0.0 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--max-total-token 65536 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
@@ -194,7 +197,6 @@ python -m sglang.launch_server \
|
||||
--device cpu \
|
||||
--host 0.0.0.0 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--max-total-token 65536 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user