Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2.5 KiB
2.5 KiB
Apply SGLang on NVIDIA Jetson Orin
Prerequisites
Before starting, ensure the following:
- NVIDIA Jetson AGX Orin Devkit is set up with JetPack 6.1 or later.
- CUDA Toolkit and cuDNN are installed.
- Verify that the Jetson AGX Orin is in high-performance mode:
sudo nvpmodel -m 0 - A custom PyPI index hosted at https://pypi.jetson-ai-lab.dev/jp6/cu126, tailored for NVIDIA Jetson Orin platforms and CUDA 12.6.
To install torch from this index:
pip install torch --index-url https://pypi.jetson-ai-lab.dev/jp6/cu126
Installation
Please refer to Installation Guide to install FlashInfer and SGLang.
Running Inference
Launch the server:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
The quantization and limited context length (--dtype half --context-length 8192) are due to the limited computational resources in Nvidia jetson kit. A detailed explanation can be found in Server Arguments.
After launching the engine, refer to Chat completions to test the usability.
Running quantization with TorchAO
TorchAO is suggested to NVIDIA Jetson Orin.
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of --torchao-config int4wo-128 is also for memory efficiency.
Structured output with XGrammar
Please refer to SGLang doc structured output.
Thanks to the support from shahizat.