2025-08-10 19:49:45 -07:00
# NVIDIA Jetson Orin
2025-02-03 15:29:10 +11:00
## Prerequisites
Before starting, ensure the following:
- [**NVIDIA Jetson AGX Orin Devkit** ](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ ) is set up with **JetPack 6.1** or later.
- **CUDA Toolkit** and **cuDNN** are installed.
- Verify that the Jetson AGX Orin is in **high-performance mode** :
2025-02-21 17:44:22 +05:00
```bash
sudo nvpmodel -m 0
```
2025-02-03 15:29:10 +11:00
* * * * *
2025-02-21 17:44:22 +05:00
## Installing and running SGLang with Jetson Containers
Clone the jetson-containers github repository:
```
git clone https://github.com/dusty-nv/jetson-containers.git
```
Run the installation script:
```
bash jetson-containers/install.sh
```
Build the container:
```
CUDA_VERSION=12.6 jetson-containers build sglang
```
Run the container:
```
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
```
2025-02-03 15:29:10 +11:00
* * * * *
Running Inference
-----------------------------------------
Launch the server:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
```
2025-03-03 13:04:32 -08:00
The quantization and limited context length (`--dtype half --context-length 8192` ) are due to the limited computational resources in [Nvidia jetson kit ](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ ). A detailed explanation can be found in [Server Arguments ](../backend/server_arguments.md ).
2025-02-03 15:29:10 +11:00
After launching the engine, refer to [Chat completions ](https://docs.sglang.ai/backend/openai_api_completions.html#Usage ) to test the usability.
* * * * *
Running quantization with TorchAO
-------------------------------------
TorchAO is suggested to NVIDIA Jetson Orin.
```bash
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
```
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
* * * * *
Structured output with XGrammar
-------------------------------
2025-08-10 21:05:18 -07:00
Please refer to [SGLang doc structured output ](../advanced_features/structured_outputs.ipynb ).
2025-02-03 15:29:10 +11:00
* * * * *
Thanks to the support from [shahizat ](https://github.com/shahizat ).
References
----------
- [NVIDIA Jetson AGX Orin Documentation ](https://developer.nvidia.com/embedded/jetson-agx-orin )