Docs: add torch compile cache (#4151)

Co-authored-by: ybyang <ybyang7@iflytek.com>
2025-03-06 14:27:09 -08:00
parent 19fd57bcd7
commit ebddb65aed
6 changed files with 48 additions and 32 deletions
--- a/docs/references/advanced_deploy.rst
+++ b/docs/references/advanced_deploy.rst
@@ -4,4 +4,4 @@ Multi-Node Deployment
   :maxdepth: 1

   multi_node.md
-   k8s.md
+   deploy_on_k8s.md
--- a/docs/references/deepseek.md
+++ b/docs/references/deepseek.md
@@ -61,8 +61,7 @@ If you encounter errors when starting the server, ensure the weights have finish

 ### Caching `torch.compile`

-The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. By default, `torch.compile` will automatically cache the FX graph and Triton in `/tmp/torchinductor_root`, which might be cleared according to the [system policy](https://serverfault.com/questions/377348/when-does-tmp-get-cleared). You can export the environment variable `TORCHINDUCTOR_CACHE_DIR` to save compilation cache in your desired directory to avoid unwanted deletion. You can also share the cache with other machines to reduce the compilation time. You may refer to the [PyTorch official documentation](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) and [SGLang Documentation](./torch_compile_cache.md) for more details.
-
+The DeepSeek series have huge model weights, it takes some time to compile the model with `torch.compile` for the first time if you have added the flag `--enable-torch-compile`. You can refer [here](https://docs.sglang.ai/backend/hyperparameter_tuning.html#try-advanced-options) to optimize the caching of compilation results, so that the cache can be used to speed up the next startup.
 ### Launch with One node of 8 H200

 Please refer to [the example](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended). **Note that Deepseek V3 is already in FP8. So we should not run it with any quantization arguments like `--quantization fp8 --kv-cache-dtype fp8_e5m2`.** Also, `--enable-dp-attention` can be useful to improve for Deepseek V3/R1's throughput. Please refer to [Data Parallelism Attention](https://docs.sglang.ai/references/deepseek.html#multi-head-latent-attention-mla-throughput-optimizations) for detail.
--- a/docs/references/deploy_on_k8s.md
+++ b/docs/references/deploy_on_k8s.md
@@ -1,4 +1,4 @@
-# Kubernetes
+# Deploy On Kubernetes

 This docs is for deploying a RoCE Network-Based SGLANG Two-Node Inference Service on a Kubernetes (K8S) Cluster.

--- a/docs/references/torch_compile_cache.md
+++ b/docs/references/torch_compile_cache.md
@@ -1,13 +0,0 @@
-# Enabling cache for torch.compile
-
-SGLang uses `max-autotune-no-cudagraphs` mode of torch.compile. The auto-tuning can be slow.
-If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps.
-
-This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html
-
-
-1. Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.
-```
-TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
-```
-2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`.