Rename sglang.bench_latency to sglang.bench_one_batch (#2118)

2024-11-21 20:07:48 -08:00
parent 8048c28c11
commit dfec7fca06
16 changed files with 521 additions and 599 deletions
--- a/docs/references/benchmark_and_profiling.md
+++ b/docs/references/benchmark_and_profiling.md
@@ -1,11 +1,16 @@
 # Benchmark and Profiling

 ## Benchmark
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, consider using `sglang.bench_serving`.
+- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
+  Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
  ```
-  python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
+  python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
  ```
- Benchmark online serving. Launch a server first and run the following command.
+- Benchmark offline processing. This script will start an offline engine and run the benchmark.
+  ```
+  python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
+  ```
+- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.
  ```
  python3 -m sglang.bench_serving --backend sglang --num-prompt 10
  ```
@@ -23,7 +28,7 @@ apt update
 apt install nsight-systems-cli
 ```

-1. To profile a single batch, use `nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512`
+1. To profile a single batch, use `nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512`

 2. To profile a server, e.g.

@@ -33,7 +38,7 @@ apt install nsight-systems-cli
 nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache

 # client
-python3 -m sglang.bench_serving --backend sglang --num-prompts 6000 --dataset-name random --random-input 4096 --random-output 2048
+python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
 ```

 3. Use NVTX, e.g.
--- a/docs/references/supported_models.md
+++ b/docs/references/supported_models.md
@@ -59,7 +59,7 @@ For interactive debugging, you can compare the outputs of huggingface/transforme
 The following two commands should give the same text output and very similar prefill logits.

 - Get the reference output by `python3 scripts/playground/reference_hf.py --model [new model]`
- Get the SGLang output by `python3 -m sglang.bench_latency --correct --model [new model]`
+- Get the SGLang output by `python3 -m sglang.bench_one_batch --correct --model [new model]`

 #### Add the model to the test suite
 To make sure the new model is well maintained in the future, it is better to add it to the test suite.
--- a/docs/start/install.md
+++ b/docs/start/install.md
@@ -59,7 +59,7 @@ drun -p 30000:30000 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

 # Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
-drun v0.3.5.post2-rocm620 python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
+drun v0.3.5.post2-rocm620 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
 ```

 ## Method 4: Using docker compose