Fix docs (#1889)

2024-11-02 11:46:00 -07:00
parent 3b60558dd7
commit 7b394e5f2b
6 changed files with 87 additions and 265 deletions
--- a/docs/references/sampling_params.md
+++ b/docs/references/sampling_params.md
@@ -177,252 +177,3 @@ print(response.json())

 The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
 Streaming is supported in a similar manner as [above](#streaming).
-
-## Performance Implications on Penalties
-
-While you can apply penalties by supplying relevant `sampling_params`, this comes with some drawbacks.
-
-These drawbacks will be applied to every single requests in the same batch, as penalizers also applies in batch.
-
-### Latency
-
-While we try to compute penalty algorithms through CUDA, it is still additional computation on top of the basic sampling logic. For detailed overhead, we recommend you to run your own benchmarks, but you can find samples below to get a glimpse.
-
-### Memory
-
-Since we compute penalty algorithms through CUDA, the logic stores relevant parameters on GPU. This is usually in a scale of `vocab_size` multiplied by `running_requests`.
-
-You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.
-
-Tuning `--mem-fraction-static` and/or `--max-running-requests` will help.
-
-### Benchmarks
-
-All the benchmarks below were ran on NVIDIA H100 SXM5.
-
-<details>
-
-#### Baseline
-
-Measured at [dc9d06d886151707f97d0b78095df9de262fd3c9](https://github.com/sgl-project/sglang/commit/dc9d06d886151707f97d0b78095df9de262fd3c9).
-
-```
-$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512
-
-============ Serving Benchmark Result ============
-Backend:                                 sglang
-Traffic request rate:                    inf
-Successful requests:                     3000
-Benchmark duration (s):                  66.11
-Total input tokens:                      378633
-Total generated tokens:                  775651
-Total generated tokens (retokenized):    775118
-Request throughput (req/s):              45.38
-Input token throughput (tok/s):          5727.04
-Output token throughput (tok/s):         11732.16
----------------End-to-End Latency----------------
-Mean E2E Latency (ms):                   40881.94
-Median E2E Latency (ms):                 43967.10
---------------Time to First Token----------------
-Mean TTFT (ms):                          19884.75
-Median TTFT (ms):                        14226.56
-P99 TTFT (ms):                           47738.97
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          91.96
-Median TPOT (ms):                        90.11
-P99 TPOT (ms):                           308.54
---------------Inter-token Latency----------------
-Mean ITL (ms):                           174.54
-Median ITL (ms):                         58.56
-P99 ITL (ms):                            440.18
-==================================================
-```
-
-#### All Together
-
-```
-$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
-  "frequency_penalty": 1.1,
-  "presence_penalty": 1.1,
-  "repetition_penalty": 0.1,
-  "min_new_tokens": 5
-}'
-
-============ Serving Benchmark Result ============
-Backend:                                 sglang
-Traffic request rate:                    inf
-Successful requests:                     3000
-Benchmark duration (s):                  78.35
-Total input tokens:                      378633
-Total generated tokens:                  775651
-Total generated tokens (retokenized):    774756
-Request throughput (req/s):              38.29
-Input token throughput (tok/s):          4832.86
-Output token throughput (tok/s):         9900.39
----------------End-to-End Latency----------------
-Mean E2E Latency (ms):                   49017.68
-Median E2E Latency (ms):                 52825.70
---------------Time to First Token----------------
-Mean TTFT (ms):                          23892.60
-Median TTFT (ms):                        18895.47
-P99 TTFT (ms):                           57426.01
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          114.54
-Median TPOT (ms):                        107.27
-P99 TPOT (ms):                           293.31
---------------Inter-token Latency----------------
-Mean ITL (ms):                           205.68
-Median ITL (ms):                         73.97
-P99 ITL (ms):                            453.86
-==================================================
-```
-
-#### Frequency Penalty
-
-```
-$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
-    "frequency_penalty": 1.1
-}'
-
-============ Serving Benchmark Result ============
-Backend:                                 sglang
-Traffic request rate:                    inf
-Successful requests:                     3000
-Benchmark duration (s):                  72.72
-Total input tokens:                      378633
-Total generated tokens:                  775651
-Total generated tokens (retokenized):    774955
-Request throughput (req/s):              41.26
-Input token throughput (tok/s):          5206.84
-Output token throughput (tok/s):         10666.51
----------------End-to-End Latency----------------
-Mean E2E Latency (ms):                   45445.56
-Median E2E Latency (ms):                 48960.39
---------------Time to First Token----------------
-Mean TTFT (ms):                          22363.16
-Median TTFT (ms):                        17125.02
-P99 TTFT (ms):                           52920.95
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          104.71
-Median TPOT (ms):                        98.30
-P99 TPOT (ms):                           268.06
---------------Inter-token Latency----------------
-Mean ITL (ms):                           191.60
-Median ITL (ms):                         67.83
-P99 ITL (ms):                            455.46
-==================================================
-```
-
-#### Presence Penalty
-
-```
-$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
-    "presence_penalty": 1.1
-}'
-
-============ Serving Benchmark Result ============
-Backend:                                 sglang
-Traffic request rate:                    inf
-Successful requests:                     3000
-Benchmark duration (s):                  72.04
-Total input tokens:                      378633
-Total generated tokens:                  775651
-Total generated tokens (retokenized):    775210
-Request throughput (req/s):              41.64
-Input token throughput (tok/s):          5255.98
-Output token throughput (tok/s):         10767.18
----------------End-to-End Latency----------------
-Mean E2E Latency (ms):                   44926.61
-Median E2E Latency (ms):                 48302.88
---------------Time to First Token----------------
-Mean TTFT (ms):                          22095.39
-Median TTFT (ms):                        16740.93
-P99 TTFT (ms):                           52554.03
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          103.54
-Median TPOT (ms):                        97.37
-P99 TPOT (ms):                           271.86
---------------Inter-token Latency----------------
-Mean ITL (ms):                           189.86
-Median ITL (ms):                         68.45
-P99 ITL (ms):                            447.11
-==================================================
-```
-
-#### Repetition Penalty
-
-```
-$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
-    "repetition_penalty": 0.1
-}'
-
-============ Serving Benchmark Result ============
-Backend: sglang
-Traffic request rate: inf
-Successful requests: 3000
-Benchmark duration (s): 74.54
-Total input tokens: 378633
-Total generated tokens: 775651
-Total generated tokens (retokenized): 766008
-Request throughput (req/s): 40.24
-Input token throughput (tok/s): 5079.36
-Output token throughput (tok/s): 10405.35
----------------End-to-End Latency----------------
-Mean E2E Latency (ms): 46530.38
-Median E2E Latency (ms): 50302.65
---------------Time to First Token----------------
-Mean TTFT (ms): 22603.47
-Median TTFT (ms): 17167.08
-P99 TTFT (ms): 54497.85
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms): 117.59
-Median TPOT (ms): 101.79
-P99 TPOT (ms): 320.04
---------------Inter-token Latency----------------
-Mean ITL (ms): 195.26
-Median ITL (ms): 69.51
-P99 ITL (ms): 433.86
-==================================================
-```
-
-#### Min New Tokens
-
-The min new tokens penalizer computes until generation process reaches given `min_new_tokens`.
-
-Dislike other penalizers, setting this to higher value will have more latency implications.
-
-```
-$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
-    "min_new_tokens": 5
-}'
-
-============ Serving Benchmark Result ============
-Backend: sglang
-Traffic request rate: inf
-Successful requests: 3000
-Benchmark duration (s): 66.94
-Total input tokens: 378633
-Total generated tokens: 775651
-Total generated tokens (retokenized): 775220
-Request throughput (req/s): 44.81
-Input token throughput (tok/s): 5656.13
-Output token throughput (tok/s): 11586.90
----------------End-to-End Latency----------------
-Mean E2E Latency (ms): 41888.55
-Median E2E Latency (ms): 45354.16
---------------Time to First Token----------------
-Mean TTFT (ms): 20866.91
-Median TTFT (ms): 16219.79
-P99 TTFT (ms): 49263.91
-----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms): 97.05
-Median TPOT (ms): 89.76
-P99 TPOT (ms): 233.50
---------------Inter-token Latency----------------
-Mean ITL (ms): 179.17
-Median ITL (ms): 55.08
-P99 ITL (ms): 409.12
-==================================================
-```
-
-</details>