sglang/docs/platforms/nvidia_jetson.md

# NVIDIA Jetson Orin

## Prerequisites

Before starting, ensure the following:

- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
- **CUDA Toolkit** and **cuDNN** are installed.
- Verify that the Jetson AGX Orin is in **high-performance mode**:
```bash
sudo nvpmodel -m 0
```
* * * * *
## Installing and running SGLang with Jetson Containers
Clone the jetson-containers github repository:
```
git clone https://github.com/dusty-nv/jetson-containers.git
```
Run the installation script:
```
bash jetson-containers/install.sh
```
Build the container:
```
CUDA_VERSION=12.6 jetson-containers build sglang
```
Run the container:
```
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
```
* * * * *

Running Inference
-----------------------------------------

Launch the server:
```bash
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --device cuda \
  --dtype half \
  --attention-backend flashinfer \
  --mem-fraction-static 0.8 \
  --context-length 8192
```
The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).

After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
* * * * *
Running quantization with TorchAO
-------------------------------------
TorchAO is suggested to NVIDIA Jetson Orin.
```bash
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --device cuda \
    --dtype bfloat16 \
    --attention-backend flashinfer \
    --mem-fraction-static 0.8 \
    --context-length 8192 \
    --torchao-config int4wo-128
```
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.


* * * * *
Structured output with XGrammar
-------------------------------
Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
* * * * *

Thanks to the support from [shahizat](https://github.com/shahizat).

References
----------
-   [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# NVIDIA Jetson Orin`
Add a Doc about guide on nvidia jetson #3182 (#3205) Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-03 15:29:10 +11:00
			`## Prerequisites`

			`Before starting, ensure the following:`

			`- [NVIDIA Jetson AGX Orin Devkit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with JetPack 6.1 or later.`
			`- CUDA Toolkit and cuDNN are installed.`
			`- Verify that the Jetson AGX Orin is in high-performance mode:`
Change description of nvidia jetson docs (#3761) 2025-02-21 17:44:22 +05:00			```bash
			`sudo nvpmodel -m 0`
			```
Add a Doc about guide on nvidia jetson #3182 (#3205) Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-03 15:29:10 +11:00			`* * * * *`
Change description of nvidia jetson docs (#3761) 2025-02-21 17:44:22 +05:00			`## Installing and running SGLang with Jetson Containers`
			`Clone the jetson-containers github repository:`
			```
			`git clone https://github.com/dusty-nv/jetson-containers.git`
			```
			`Run the installation script:`
			```
			`bash jetson-containers/install.sh`
			```
			`Build the container:`
			```
			`CUDA_VERSION=12.6 jetson-containers build sglang`
			```
			`Run the container:`
			```
			`docker run --runtime nvidia -it --rm --network=host IMAGE_NAME`
			```
Add a Doc about guide on nvidia jetson #3182 (#3205) Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-03 15:29:10 +11:00			`* * * * *`

			`Running Inference`
			`-----------------------------------------`

			`Launch the server:`
			```bash
			`python -m sglang.launch_server \`
			`--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \`
			`--device cuda \`
			`--dtype half \`
			`--attention-backend flashinfer \`
			`--mem-fraction-static 0.8 \`
			`--context-length 8192`
			```
Add examples in sampling parameters (#4039) 2025-03-03 13:04:32 -08:00			The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
Add a Doc about guide on nvidia jetson #3182 (#3205) Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-03 15:29:10 +11:00
			`After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.`
			`* * * * *`
			`Running quantization with TorchAO`
			`-------------------------------------`
			`TorchAO is suggested to NVIDIA Jetson Orin.`
			```bash
			`python -m sglang.launch_server \`
			`--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \`
			`--device cuda \`
			`--dtype bfloat16 \`
			`--attention-backend flashinfer \`
			`--mem-fraction-static 0.8 \`
			`--context-length 8192 \`
			`--torchao-config int4wo-128`
			```
			This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.


			`* * * * *`
			`Structured output with XGrammar`
			`-------------------------------`
Improve docs and developer guide (#9044) 2025-08-10 21:05:18 -07:00			`Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).`
Add a Doc about guide on nvidia jetson #3182 (#3205) Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-03 15:29:10 +11:00			`* * * * *`

			`Thanks to the support from [shahizat](https://github.com/shahizat).`

			`References`
			`----------`
			`- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)`