From 455bfe8dd35f70fe0e34fd9971397be436a85f58 Mon Sep 17 00:00:00 2001 From: Liangjun Song Date: Mon, 3 Feb 2025 15:29:10 +1100 Subject: [PATCH] Add a Doc about guide on nvidia jetson #3182 (#3205) Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 --- docs/index.rst | 1 + docs/references/nvidia_jetson.md | 67 ++++++++++++++++++++++++++++++++ 2 files changed, 68 insertions(+) create mode 100644 docs/references/nvidia_jetson.md diff --git a/docs/index.rst b/docs/index.rst index b8067c25d..f6f14725f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -64,5 +64,6 @@ The core features include: references/modelscope.md references/contribution_guide.md references/troubleshooting.md + references/nvidia_jetson.md references/faq.md references/learn_more.md diff --git a/docs/references/nvidia_jetson.md b/docs/references/nvidia_jetson.md new file mode 100644 index 000000000..a36a42ba4 --- /dev/null +++ b/docs/references/nvidia_jetson.md @@ -0,0 +1,67 @@ +# Apply SGLang on NVIDIA Jetson Orin + +## Prerequisites + +Before starting, ensure the following: + +- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later. +- **CUDA Toolkit** and **cuDNN** are installed. +- Verify that the Jetson AGX Orin is in **high-performance mode**: + ```bash + sudo nvpmodel -m 0 + ``` +- A custom PyPI index hosted at https://pypi.jetson-ai-lab.dev/jp6/cu126, tailored for NVIDIA Jetson Orin platforms and CUDA 12.6. + +To install torch from this index: + ```bash +pip install torch --index-url https://pypi.jetson-ai-lab.dev/jp6/cu126 + ``` +* * * * * +## Installation +Please refer to [Installation Guide](https://docs.sglang.ai/start/install.html) to install FlashInfer and SGLang. +* * * * * + +Running Inference +----------------------------------------- + +Launch the server: +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ + --device cuda \ + --dtype half \ + --attention-backend flashinfer \ + --mem-fraction-static 0.8 \ + --context-length 8192 +``` +The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](https://docs.sglang.ai/backend/server_arguments.html). + +After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability. +* * * * * +Running quantization with TorchAO +------------------------------------- +TorchAO is suggested to NVIDIA Jetson Orin. +```bash +python -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --device cuda \ + --dtype bfloat16 \ + --attention-backend flashinfer \ + --mem-fraction-static 0.8 \ + --context-length 8192 \ + --torchao-config int4wo-128 +``` +This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency. + + +* * * * * +Structured output with XGrammar +------------------------------- +Please refer to [SGLang doc structured output](https://docs.sglang.ai/backend/structured_outputs.html). +* * * * * + +Thanks to the support from [shahizat](https://github.com/shahizat). + +References +---------- +- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)