[CPU] fix OOM when mem-fraction is not set (#9090)

2025-09-11 14:52:22 +08:00
parent 4aa1e69bc7
commit ef959d7b85
6 changed files with 29 additions and 16 deletions
--- a/docs/platforms/cpu_server.md
+++ b/docs/platforms/cpu_server.md
@@ -84,13 +84,13 @@ git checkout <YOUR-DESIRED-VERSION>
 # Install SGLang dependent libs, and build SGLang main package
 pip install --upgrade pip setuptools
 conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
-pip install intel-openmp
 pip install -e "python[all_cpu]"
+pip install torch==2.7.1 torchvision==0.22.1 triton==3.3.1 --force-reinstall

 # Build the CPU backend kernels
 cd sgl-kernel
 cp pyproject_cpu.toml pyproject.toml
-pip install -v .
+pip install .

 # Other required environment variables
 # Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
@@ -134,13 +134,17 @@ Notes:
    export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
    ```

+    Please beware that with SGLANG_CPU_OMP_THREADS_BIND set,
+    the available memory amounts of the ranks may not be determined in prior.
+    You may need to set proper `--max-total-tokens` to avoid the out-of-memory error.
+
 3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`.
    To specify the maximum batch size when using torch compile, set the flag `--torch-compile-max-bs`.
    For example, `--enable-torch-compile --torch-compile-max-bs 4` means using torch compile and setting the
    maximum batch size to 4.

 4. A warmup step is automatically triggered when the service is started.
-The server is ready when you see the log `The server is fired up and ready to roll!`.
+    The server is ready when you see the log `The server is fired up and ready to roll!`.

 ## Benchmarking with Requests

@@ -164,7 +168,7 @@ python -m sglang.bench_serving -h
 ```

 Additionally, the requests can be formed with
-[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
+[OpenAI Completions API](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
 and sent via the command line (e.g. using `curl`) or via your own script.

 ## Example: Running DeepSeek-R1
@@ -180,7 +184,6 @@ python -m sglang.launch_server                 \
    --quantization w8a8_int8                   \
    --host 0.0.0.0                             \
    --mem-fraction-static 0.8                  \
-    --max-total-token 65536                    \
    --tp 6
 ```

@@ -194,7 +197,6 @@ python -m sglang.launch_server                 \
    --device cpu                               \
    --host 0.0.0.0                             \
    --mem-fraction-static 0.8                  \
-    --max-total-token 65536                    \
    --tp 6
 ```