diff --git a/README.md b/README.md index d82eff297..d36464681 100644 --- a/README.md +++ b/README.md @@ -46,14 +46,15 @@ ## About -SGLang is a fast serving framework for large language models and vision language models. -It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. -The core features include: +SGLang is a high-performance serving framework for large language models and vision-language models. +It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters. +Its core features include: -- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching. -- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions. -- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models. -- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption. +- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching. +- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), and reward models (Skywork), with easy extensibility for integrating new models. Compatible with most Hugging Face models and OpenAI APIs. +- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more. +- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, supporting chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions. +- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 300,000 GPUs worldwide. ## Getting Started - [Install SGLang](https://docs.sglang.ai/get_started/install.html) @@ -69,7 +70,8 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s [Development Roadmap (2025 H2)](https://github.com/sgl-project/sglang/issues/7736) ## Adoption and Sponsorship -SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 1,000,000 GPUs worldwide. +SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 300,000 GPUs worldwide. +SGLang is currently hosted under the non-profit open-source organization [LMSYS](https://lmsys.org/about/). logo diff --git a/docs/get_started/install.md b/docs/get_started/install.md index ca55933f5..ae314b97a 100644 --- a/docs/get_started/install.md +++ b/docs/get_started/install.md @@ -12,7 +12,7 @@ It is recommended to use uv for faster installation: ```bash pip install --upgrade pip pip install uv -uv pip install sglang --upgrade +uv pip install sglang --prerelease=allow ``` **Quick fixes to common problems** @@ -129,5 +129,3 @@ sky status --endpoint 30000 sglang - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. - To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`. -- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. -- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime. diff --git a/docs/index.rst b/docs/index.rst index 480612bb9..f34f13e37 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,14 +1,15 @@ SGLang Documentation ==================== -SGLang is a fast serving framework for large language models and vision language models. -It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. -The core features include: +SGLang is a high-performance serving framework for large language models and vision-language models. +It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters. +Its core features include: -- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching. -- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions. -- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models. -- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption. +- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching. +- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), and reward models (Skywork), with easy extensibility for integrating new models. Compatible with most Hugging Face models and OpenAI APIs. +- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more. +- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, supporting chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions. +- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 300,000 GPUs worldwide. .. toctree:: :maxdepth: 1 diff --git a/docs/platforms/tpu.md b/docs/platforms/tpu.md index f304234cf..c9d32f2dd 100644 --- a/docs/platforms/tpu.md +++ b/docs/platforms/tpu.md @@ -1,3 +1,3 @@ # TPU -The support for TPU is under active development. Please stay tuned. +See https://github.com/sgl-project/sglang-jax diff --git a/python/pyproject.toml b/python/pyproject.toml index c77845f35..804b4291c 100755 --- a/python/pyproject.toml +++ b/python/pyproject.toml @@ -13,6 +13,7 @@ classifiers = [ "Programming Language :: Python :: 3", "License :: OSI Approved :: Apache Software License", ] + dependencies = [ "IPython", "aiohttp", @@ -21,6 +22,7 @@ dependencies = [ "build", "compressed-tensors", "cuda-python", + "decord2", "datasets", "einops", "fastapi", @@ -73,7 +75,12 @@ dependencies = [ ] [project.optional-dependencies] -decord = ["decord2"] +tracing = [ + "opentelemetry-api", + "opentelemetry-exporter-otlp", + "opentelemetry-exporter-otlp-proto-grpc", + "opentelemetry-sdk", +] test = [ "accelerate", "expecttest", @@ -86,13 +93,10 @@ test = [ "sentence_transformers", "tabulate", ] -tracing = [ - "opentelemetry-api", - "opentelemetry-exporter-otlp", - "opentelemetry-exporter-otlp-proto-grpc", - "opentelemetry-sdk", -] -all = ["sglang[test]", "sglang[decord]"] +all = [] +dev = ["sglang[test]"] + +# Temporary tags cu130 = [ "torch==2.9.0", "torchaudio==2.9.0", @@ -104,13 +108,9 @@ cu130_all = [ "sglang[cu130]" ] - -# The following will be deprecated in 2 weeks -dev = ["sglang[test]", "sglang[decord]"] -all_aarch64 = ["sglang[test]"] -blackwell = ["sglang[test]", "sglang[decord]"] -blackwell_aarch64 = ["sglang[test]"] - +# To be deprecated in 2 weeks +blackwell = ["sglang[dev]"] +blackwell_aarch64 = ["sglang[dev]"] [project.urls] "Homepage" = "https://github.com/sgl-project/sglang" diff --git a/python/pyproject_cpu.toml b/python/pyproject_cpu.toml index 0fe73680c..224e79680 100644 --- a/python/pyproject_cpu.toml +++ b/python/pyproject_cpu.toml @@ -5,85 +5,88 @@ build-backend = "setuptools.build_meta" [project] name = "sglang" -version = "0.5.3rc0" +version = "0.5.3.post3" description = "SGLang is a fast serving framework for large language models and vision language models." readme = "README.md" requires-python = ">=3.10" license = { file = "LICENSE" } classifiers = [ - "Programming Language :: Python :: 3", - "License :: OSI Approved :: Apache Software License", + "Programming Language :: Python :: 3", + "License :: OSI Approved :: Apache Software License", ] dependencies = [ - "aiohttp", - "anthropic>=0.20.0", - "blobfile==3.0.0", - "build", - "compressed-tensors", - "datasets", - "decord", - "einops", - "fastapi", - "hf_transfer", - "huggingface_hub", - "intel-openmp", - "interegular", - "IPython", - "llguidance>=0.7.11,<0.8.0", - "modelscope", - "msgspec", - "ninja", - "numpy", - "openai==1.99.1", - "openai-harmony==0.0.4", - "orjson", - "outlines==0.1.11", - "packaging", - "partial_json_parser", - "pillow", - "prometheus-client>=0.20.0", - "psutil", - "pybase64", - "pydantic", - "python-multipart", - "pyzmq>=25.1.2", - "requests", - "scipy", - "sentencepiece", - "setproctitle", - "soundfile==0.13.1", - "tiktoken", - "timm==1.0.16", - "torchao==0.9.0", - "tqdm", - "transformers==4.57.1", - "uvicorn", - "uvloop", - "xgrammar==0.1.25", + "IPython", + "aiohttp", + "anthropic>=0.20.0", + "blobfile==3.0.0", + "build", + "compressed-tensors", + "datasets", + "decord", + "einops", + "fastapi", + "hf_transfer", + "huggingface_hub", + "intel-openmp", + "interegular", + "llguidance>=0.7.11,<0.8.0", + "modelscope", + "msgspec", + "ninja", + "numpy", + "openai-harmony==0.0.4", + "openai==1.99.1", + "orjson", + "outlines==0.1.11", + "packaging", + "partial_json_parser", + "pillow", + "prometheus-client>=0.20.0", + "psutil", + "py-spy", + "pybase64", + "pydantic", + "python-multipart", + "pyzmq>=25.1.2", + "requests", + "scipy", + "sentencepiece", + "setproctitle", + "soundfile==0.13.1", + "tiktoken", + "timm==1.0.16", + "torchao==0.9.0", + "tqdm", + "transformers==4.57.1", + "uvicorn", + "uvloop", + "xgrammar==0.1.25", + "grpcio==1.75.1", # keep it align with compile_proto.py + "grpcio-tools==1.75.1", # keep it align with compile_proto.py + "grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py ] [project.optional-dependencies] tracing = [ - "opentelemetry-sdk", - "opentelemetry-api", - "opentelemetry-exporter-otlp", - "opentelemetry-exporter-otlp-proto-grpc", + "opentelemetry-sdk", + "opentelemetry-api", + "opentelemetry-exporter-otlp", + "opentelemetry-exporter-otlp-proto-grpc", ] - test = [ - "accelerate", - "expecttest", - "jsonlines", - "matplotlib", - "pandas", - "peft", - "sentence_transformers", - "pytest", - "tabulate", + "accelerate", + "expecttest", + "jsonlines", + "matplotlib", + "pandas", + "peft", + "pytest", + "sentence_transformers", + "tabulate", ] - -dev = ["sglang", "sglang[test]"] +all = [] +dev = ["sglang[test]"] [project.urls] "Homepage" = "https://github.com/sgl-project/sglang" @@ -91,31 +94,33 @@ dev = ["sglang", "sglang[test]"] [tool.setuptools.package-data] "sglang" = [ - "srt/layers/moe/fused_moe_triton/configs/*/*.json", - "srt/layers/quantization/configs/*.json", - "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp", + "srt/layers/moe/fused_moe_triton/configs/*/*.json", + "srt/layers/quantization/configs/*.json", + "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp", + "srt/speculative/cpp_ngram/*.cpp", + "srt/speculative/cpp_ngram/*.h", ] [tool.setuptools.packages.find] exclude = [ - "assets*", - "benchmark*", - "docs*", - "dist*", - "playground*", - "scripts*", - "tests*", + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", ] [tool.wheel] exclude = [ - "assets*", - "benchmark*", - "docs*", - "dist*", - "playground*", - "scripts*", - "tests*", + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", ] [tool.codespell] diff --git a/python/pyproject_other.toml b/python/pyproject_other.toml index 69172103f..3fd0bee57 100755 --- a/python/pyproject_other.toml +++ b/python/pyproject_other.toml @@ -10,76 +10,77 @@ readme = "README.md" requires-python = ">=3.10" license = { file = "LICENSE" } classifiers = [ - "Programming Language :: Python :: 3", - "License :: OSI Approved :: Apache Software License", + "Programming Language :: Python :: 3", + "License :: OSI Approved :: Apache Software License", ] dependencies = ["aiohttp", "requests", "tqdm", "numpy", "IPython", "setproctitle"] [project.optional-dependencies] runtime_common = [ - "blobfile==3.0.0", - "build", - "compressed-tensors", - "datasets", - "einops", - "fastapi", - "hf_transfer", - "huggingface_hub", - "interegular", - "llguidance>=0.7.11,<0.8.0", - "modelscope", - "msgspec", - "ninja", - "openai==1.99.1", - "openai-harmony==0.0.4", - "orjson", - "outlines==0.1.11", - "packaging", - "partial_json_parser", - "pillow", - "prometheus-client>=0.20.0", - "psutil", - "pybase64", - "pydantic", - "pynvml", - "python-multipart", - "pyzmq>=25.1.2", - "scipy", - "sentencepiece", - "soundfile==0.13.1", - "timm==1.0.16", - "tiktoken", - "torchao==0.9.0", - "transformers==4.57.1", - "uvicorn", - "uvloop", - "xgrammar==0.1.25", + "IPython", + "aiohttp", + "anthropic>=0.20.0", + "blobfile==3.0.0", + "build", + "compressed-tensors", + "decord2", + "datasets", + "einops", + "fastapi", + "hf_transfer", + "huggingface_hub", + "interegular", + "llguidance>=0.7.11,<0.8.0", + "modelscope", + "msgspec", + "ninja", + "numpy", + "openai-harmony==0.0.4", + "openai==1.99.1", + "orjson", + "outlines==0.1.11", + "packaging", + "partial_json_parser", + "pillow", + "prometheus-client>=0.20.0", + "psutil", + "py-spy", + "pybase64", + "pydantic", + "python-multipart", + "pyzmq>=25.1.2", + "requests", + "scipy", + "sentencepiece", + "setproctitle", + "soundfile==0.13.1", + "tiktoken", + "timm==1.0.16", + "torchao==0.9.0", + "tqdm", + "transformers==4.57.1", + "uvicorn", + "uvloop", + "xgrammar==0.1.25", + "grpcio==1.75.1", # keep it align with compile_proto.py + "grpcio-tools==1.75.1", # keep it align with compile_proto.py + "grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py ] tracing = [ - "opentelemetry-sdk", - "opentelemetry-api", - "opentelemetry-exporter-otlp", - "opentelemetry-exporter-otlp-proto-grpc", -] - -srt = [ - "sglang[runtime_common]", - "sgl-kernel==0.3.15", - "torch==2.8.0", - "torchaudio==2.8.0", - "torchvision", - "cuda-python", - "flashinfer_python==0.4.0", + "opentelemetry-sdk", + "opentelemetry-api", + "opentelemetry-exporter-otlp", + "opentelemetry-exporter-otlp-proto-grpc", ] # HIP (Heterogeneous-computing Interface for Portability) for AMD # => base docker rocm/vllm-dev:20250114, not from public vllm whl srt_hip = [ - "sglang[runtime_common]", - "torch", - "petit_kernel==0.0.2", - "wave-lang==3.7.0", + "sglang[runtime_common]", + "torch", + "petit_kernel==0.0.2", + "wave-lang==3.7.0", ] # https://docs.sglang.ai/platforms/ascend_npu.html @@ -89,29 +90,24 @@ srt_npu = ["sglang[runtime_common]"] # https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html srt_hpu = ["sglang[runtime_common]"] -openai = ["openai==1.99.1", "tiktoken"] -anthropic = ["anthropic>=0.20.0"] -litellm = ["litellm>=1.0.0"] -torch_memory_saver = ["torch_memory_saver==0.0.9rc1"] -decord = ["decord"] test = [ - "accelerate", - "expecttest", - "jsonlines", - "matplotlib", - "pandas", - "peft", - "sentence_transformers", - "pytest", - "tabulate", + "accelerate", + "expecttest", + "gguf", + "jsonlines", + "matplotlib", + "pandas", + "peft", + "pytest", + "sentence_transformers", + "tabulate", ] -all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[torch_memory_saver]", "sglang[decord]"] -all_hip = ["sglang[srt_hip]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"] -all_hpu = ["sglang[srt_hpu]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"] -all_npu = ["sglang[srt_npu]", "sglang[openai]", "sglang[anthropic]", "sglang[decord]"] +all_hip = ["sglang[srt_hip]"] +all_npu = ["sglang[srt_npu]"] +all_hpu = ["sglang[srt_hpu]"] -dev = ["sglang[all]", "sglang[test]"] dev_hip = ["sglang[all_hip]", "sglang[test]"] +dev_npu = ["sglang[all_npu]", "sglang[test]"] dev_hpu = ["sglang[all_hpu]", "sglang[test]"] [project.urls] @@ -120,31 +116,33 @@ dev_hpu = ["sglang[all_hpu]", "sglang[test]"] [tool.setuptools.package-data] "sglang" = [ - "srt/layers/moe/fused_moe_triton/configs/*/*.json", - "srt/layers/quantization/configs/*.json", - "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp", + "srt/layers/moe/fused_moe_triton/configs/*/*.json", + "srt/layers/quantization/configs/*.json", + "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp", + "srt/speculative/cpp_ngram/*.cpp", + "srt/speculative/cpp_ngram/*.h", ] [tool.setuptools.packages.find] exclude = [ - "assets*", - "benchmark*", - "docs*", - "dist*", - "playground*", - "scripts*", - "tests*", + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", ] [tool.wheel] exclude = [ - "assets*", - "benchmark*", - "docs*", - "dist*", - "playground*", - "scripts*", - "tests*", + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", ] [tool.codespell] diff --git a/python/pyproject_xpu.toml b/python/pyproject_xpu.toml index 0a83e021b..26557be88 100644 --- a/python/pyproject_xpu.toml +++ b/python/pyproject_xpu.toml @@ -6,84 +6,87 @@ build-backend = "setuptools.build_meta" [project] name = "sglang" -version = "0.5.3rc0" +version = "0.5.3.post3" description = "SGLang is a fast serving framework for large language models and vision language models." readme = "README.md" requires-python = ">=3.10" license = { file = "LICENSE" } classifiers = [ - "Programming Language :: Python :: 3", - "License :: OSI Approved :: Apache Software License", + "Programming Language :: Python :: 3", + "License :: OSI Approved :: Apache Software License", ] dependencies = [ - "aiohttp", - "anthropic>=0.20.0", - "blobfile==3.0.0", - "build", - "compressed-tensors", - "datasets", - "decord", - "einops", - "fastapi", - "hf_transfer", - "huggingface_hub", - "interegular", - "IPython", - "llguidance>=0.7.11,<0.8.0", - "modelscope", - "msgspec", - "ninja", - "numpy", - "openai==1.99.1", - "openai-harmony==0.0.4", - "orjson", - "outlines==0.1.11", - "packaging", - "partial_json_parser", - "pillow", - "prometheus-client>=0.20.0", - "psutil", - "pybase64", - "pydantic", - "python-multipart", - "pyzmq>=25.1.2", - "requests", - "scipy", - "sentencepiece", - "setproctitle", - "soundfile==0.13.1", - "tiktoken", - "timm==1.0.16", - "torchao==0.9.0", - "tqdm", - "transformers==4.57.1", - "uvicorn", - "uvloop", - "xgrammar==0.1.25", + "IPython", + "aiohttp", + "anthropic>=0.20.0", + "blobfile==3.0.0", + "build", + "compressed-tensors", + "datasets", + "decord", + "einops", + "fastapi", + "hf_transfer", + "huggingface_hub", + "interegular", + "llguidance>=0.7.11,<0.8.0", + "modelscope", + "msgspec", + "ninja", + "numpy", + "openai-harmony==0.0.4", + "openai==1.99.1", + "orjson", + "outlines==0.1.11", + "packaging", + "partial_json_parser", + "pillow", + "prometheus-client>=0.20.0", + "psutil", + "py-spy", + "pybase64", + "pydantic", + "python-multipart", + "pyzmq>=25.1.2", + "requests", + "scipy", + "sentencepiece", + "setproctitle", + "soundfile==0.13.1", + "tiktoken", + "timm==1.0.16", + "torchao==0.9.0", + "tqdm", + "transformers==4.57.1", + "uvicorn", + "uvloop", + "xgrammar==0.1.25", + "grpcio==1.75.1", # keep it align with compile_proto.py + "grpcio-tools==1.75.1", # keep it align with compile_proto.py + "grpcio-reflection==1.75.1", # required by srt/entrypoints/grpc_server.py ] [project.optional-dependencies] tracing = [ - "opentelemetry-sdk", - "opentelemetry-api", - "opentelemetry-exporter-otlp", - "opentelemetry-exporter-otlp-proto-grpc", + "opentelemetry-sdk", + "opentelemetry-api", + "opentelemetry-exporter-otlp", + "opentelemetry-exporter-otlp-proto-grpc", ] - test = [ - "accelerate", - "expecttest", - "jsonlines", - "matplotlib", - "pandas", - "peft", - "sentence_transformers", - "pytest", - "tabulate", + "accelerate", + "expecttest", + "jsonlines", + "matplotlib", + "pandas", + "peft", + "pytest", + "sentence_transformers", + "tabulate", ] - -dev = ["sglang", "sglang[test]"] +all = [] +dev = ["sglang[test]"] [project.urls] "Homepage" = "https://github.com/sgl-project/sglang" @@ -91,31 +94,33 @@ dev = ["sglang", "sglang[test]"] [tool.setuptools.package-data] "sglang" = [ - "srt/layers/moe/fused_moe_triton/configs/*/*.json", - "srt/layers/quantization/configs/*.json", - "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp", + "srt/layers/moe/fused_moe_triton/configs/*/*.json", + "srt/layers/quantization/configs/*.json", + "srt/mem_cache/storage/hf3fs/hf3fs_utils.cpp", + "srt/speculative/cpp_ngram/*.cpp", + "srt/speculative/cpp_ngram/*.h", ] [tool.setuptools.packages.find] exclude = [ - "assets*", - "benchmark*", - "docs*", - "dist*", - "playground*", - "scripts*", - "tests*", + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", ] [tool.wheel] exclude = [ - "assets*", - "benchmark*", - "docs*", - "dist*", - "playground*", - "scripts*", - "tests*", + "assets*", + "benchmark*", + "docs*", + "dist*", + "playground*", + "scripts*", + "tests*", ] [tool.codespell] diff --git a/python/sglang/srt/model_executor/model_runner.py b/python/sglang/srt/model_executor/model_runner.py index 23faf600b..6f8bc40ee 100644 --- a/python/sglang/srt/model_executor/model_runner.py +++ b/python/sglang/srt/model_executor/model_runner.py @@ -623,7 +623,7 @@ class ModelRunner: server_args.disable_chunked_prefix_cache = True if not server_args.disable_chunked_prefix_cache: - logger.info("Chunked prefix cache is turned on.") + log_info_on_rank0(logger, "Chunked prefix cache is turned on.") if server_args.attention_backend == "aiter": if self.model_config.context_len > 8192: diff --git a/python/sglang/srt/server_args.py b/python/sglang/srt/server_args.py index e363dbf7d..71113c5e8 100644 --- a/python/sglang/srt/server_args.py +++ b/python/sglang/srt/server_args.py @@ -253,7 +253,6 @@ class ServerArgs: log_requests: bool = False log_requests_level: int = 2 crash_dump_folder: Optional[str] = None - crash_on_nan: bool = False show_time_cost: bool = False enable_metrics: bool = False enable_metrics_for_all_schedulers: bool = False @@ -1899,12 +1898,6 @@ class ServerArgs: default=ServerArgs.crash_dump_folder, help="Folder path to dump requests from the last 5 min before a crash (if any). If not specified, crash dumping is disabled.", ) - parser.add_argument( - "--crash-on-nan", - type=str, - default=ServerArgs.crash_on_nan, - help="Crash the server on nan logprobs.", - ) parser.add_argument( "--show-time-cost", action="store_true",