sglang/docs/developer_guide/benchmark_and_profiling.md

# Benchmark and Profiling

## Benchmark

- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
  Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
  - Without a server (do not need to launch a server)
    ```bash
    python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
    ```
  - With a server (please use `sglang.launch_server` to launch a server first and run the following command.)
    ```bash
    python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
    ```


- Benchmark offline processing. This script will start an offline engine and run the benchmark.

  ```bash
  python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
  ```

- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.

  ```bash
  python3 -m sglang.bench_serving --backend sglang --num-prompt 10
  ```

## Profile with PyTorch Profiler

[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.

### Profile a server with `sglang.bench_serving`

```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log

# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct

# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
```

Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).

For more details, please refer to [Bench Serving Guide](./bench_serving.md).

### Profile a server with `sglang.bench_offline_throughput`
```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log

# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile

# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```

### Profile a server with `sglang.profiler`

When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.

You can do this by running `python3 -m sglang.profiler`. For example:

```
# Terminal 1: Send a generation request
python3 -m sglang.test.send_one

# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
# It will generate a profile of the above request for several decoding batches.
python3 -m sglang.profiler
```

### Possible PyTorch bugs
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
```bash
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
```
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
```bash
export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```

### View traces

Trace files can be loaded and visualized from:

1. https://ui.perfetto.dev/ (any browser)
2. chrome://tracing (Chrome browser only)

If browser cannot open trace file due to its large size,
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
For example, when profiling a server,

```bash
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
```

This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.

Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.

## Profile with Nsight

[Nsight systems](https://docs.nvidia.com/nsight-systems/) is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.

1. Prerequisite:

   Install using apt, or run inside a [NVIDIA Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) or [SGLang Docker container](https://github.com/sgl-project/sglang/tree/main/docker).

   ```bash
   # install nsys
   # https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
   apt update
   apt install -y --no-install-recommends gnupg
   echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
   apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
   apt update
   apt install nsight-systems-cli
   ```

2. To profile a single batch, use

   ```bash
   nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512
   ```

3. To profile a server, e.g.

   ```bash
   # launch the server, set the delay and duration times according to needs
   # after the duration time has been used up, server will be killed by nsys

   nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache

   # client
   python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
   ```

   In practice, we recommend users to set `--duration` argument to a large value. Whenever user wants the server to stop profiling. Firstly run:

   ```bash
   nsys sessions list
   ```

   to get the session id in the form of `profile-XXXXX`, then run:

   ```bash
   nsys stop --session=profile-XXXXX
   ```

   to manually kill the profiler and generate `nsys-rep` files instantly.

4. Use NVTX to annotate code regions, e.g. to see their execution time.

   ```bash
   # install nvtx
   pip install nvtx
   ```

   ```python
   # code snippets
   import nvtx
   with nvtx.annotate("description", color="color"):
       # some critical code
   ```

## Other tips

1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
2. You can benchmark a model with modified configs (e.g., less layers) by using `--json-model-override-args`. For example, you can benchmark a model with only 2 layers and 2 kv heads using:

   ```bash
   python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'
   ```

3. You can use `--python-backtrace=cuda` to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing)
4. For more arguments see [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00			`# Benchmark and Profiling`

			`## Benchmark`
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00			- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
			`Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.`
doc: add bench_one_batch_server in the benchmark doc (#8441) 2025-07-27 23:07:54 -07:00			`- Without a server (do not need to launch a server)`
			```bash
			`python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32`
			```
			- With a server (please use `sglang.launch_server` to launch a server first and run the following command.)
			```bash
			`python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32`
			```
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00

Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00			`- Benchmark offline processing. This script will start an offline engine and run the benchmark.`
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
			```bash
Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00			`python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10`
			```
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00			- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
			```bash
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00			`python3 -m sglang.bench_serving --backend sglang --num-prompt 10`
			```

[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00			`## Profile with PyTorch Profiler`
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
			`[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.`

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			### Profile a server with `sglang.bench_serving`
[docs] Instructions for bench_serving.py (#9071) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> 2025-08-27 09:30:57 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			```bash
			`# set trace path`
			`export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log`
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# start server`
			`python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct`
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# send profiling request from client`
			`python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile`
			```
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
[docs] Instructions for bench_serving.py (#9071) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> 2025-08-27 09:30:57 +08:00			`For more details, please refer to [Bench Serving Guide](./bench_serving.md).`

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			### Profile a server with `sglang.bench_offline_throughput`
			```bash
			`export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log`
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# profile one batch with bench_one_batch.py`
			`# batch size can be controlled with --batch argument`
			`python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile`
[Doc]Add instruction for profiling with bench_one_batch (#5581) 2025-04-20 14:05:36 -07:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# profile multiple batches with bench_offline_throughput.py`
			`python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8`
			```
[Doc]Add instruction for profiling with bench_one_batch (#5581) 2025-04-20 14:05:36 -07:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			### Profile a server with `sglang.profiler`
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.`
Fix torch profiler bugs for bench_offline_throughput.py (#6557) 2025-06-09 20:33:41 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			You can do this by running `python3 -m sglang.profiler`. For example:
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			```
			`# Terminal 1: Send a generation request`
			`python3 -m sglang.test.send_one`
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.`
			`# It will generate a profile of the above request for several decoding batches.`
			`python3 -m sglang.profiler`
			```
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`### Possible PyTorch bugs`
			`If in any cases you encounter the following error (for example, using qwen 2.5 VL):`
			```bash
			`RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.`
			```
			This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
			```bash
			`export SGLANG_PROFILE_WITH_STACK=False`
			`python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8`
			```
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`### View traces`

			`Trace files can be loaded and visualized from:`

			`1. https://ui.perfetto.dev/ (any browser)`
			`2. chrome://tracing (Chrome browser only)`

			`If browser cannot open trace file due to its large size,`
			`client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.`
			`For example, when profiling a server,`

			```bash
			`python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile`
			```
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
[doc] Update benchmark_and_profiling.md (#5449) 2025-04-16 14:27:34 +08:00
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00			`## Profile with Nsight`
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`[Nsight systems](https://docs.nvidia.com/nsight-systems/) is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.`

			`1. Prerequisite:`
[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`Install using apt, or run inside a [NVIDIA Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) or [SGLang Docker container](https://github.com/sgl-project/sglang/tree/main/docker).`
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			```bash
			`# install nsys`
			`# https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html`
			`apt update`
			`apt install -y --no-install-recommends gnupg`
			`echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" \| tr -d .)/$(dpkg --print-architecture) /" \| tee /etc/apt/sources.list.d/nvidia-devtools.list`
			`apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub`
			`apt update`
			`apt install nsight-systems-cli`
			```
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`2. To profile a single batch, use`
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			```bash
			`nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512`
			```
[Docs]Add instruction for manually stopping nsys profiler (#3795) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-23 13:21:48 -08:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`3. To profile a server, e.g.`
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			```bash
			`# launch the server, set the delay and duration times according to needs`
			`# after the duration time has been used up, server will be killed by nsys`
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache`
[Docs]Add instruction for manually stopping nsys profiler (#3795) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-23 13:21:48 -08:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`# client`
			`python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512`
			```
[Docs]Add instruction for manually stopping nsys profiler (#3795) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-23 13:21:48 -08:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			In practice, we recommend users to set `--duration` argument to a large value. Whenever user wants the server to stop profiling. Firstly run:
[Docs]Add instruction for manually stopping nsys profiler (#3795) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-23 13:21:48 -08:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			```bash
			`nsys sessions list`
			```
[Docs]Add instruction for manually stopping nsys profiler (#3795) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-23 13:21:48 -08:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			to get the session id in the form of `profile-XXXXX`, then run:
[Docs]Add instruction for manually stopping nsys profiler (#3795) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> 2025-02-23 13:21:48 -08:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			```bash
			`nsys stop --session=profile-XXXXX`
			```
Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			to manually kill the profiler and generate `nsys-rep` files instantly.

			`4. Use NVTX to annotate code regions, e.g. to see their execution time.`

			```bash
			`# install nvtx`
			`pip install nvtx`
			```

			```python
			`# code snippets`
			`import nvtx`
			`with nvtx.annotate("description", color="color"):`
			`# some critical code`
			```
Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00
			`## Other tips`
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00
Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00			1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			2. You can benchmark a model with modified configs (e.g., less layers) by using `--json-model-override-args`. For example, you can benchmark a model with only 2 layers and 2 kv heads using:

			```bash
			`python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'`
			```

[Docs] Add more details to profiling docs (#3221) 2025-01-31 17:59:28 -06:00			3. You can use `--python-backtrace=cuda` to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing)
[Docs] Clean up benchmark_and_profiling.md (#4297) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> 2025-03-12 12:48:21 +08:00			`4. For more arguments see [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).`