sglang/test/srt/test_bench_one_batch.py

import os
import unittest

from sglang.test.test_utils import (
    DEFAULT_MODEL_NAME_FOR_TEST,
    DEFAULT_MOE_MODEL_NAME_FOR_TEST,
    DEFAULT_SMALL_MODEL_NAME_FOR_TEST,
    CustomTestCase,
    is_in_amd_ci,
    is_in_ci,
    run_bench_offline_throughput,
    run_bench_one_batch,
    write_github_step_summary,
)

# We use `run_bench_offline_throughput`` instead of `run_bench_one_batch` for most cases
# because `run_bench_offline_throughput`` has overlap scheduler.


class TestBenchOneBatch(CustomTestCase):

    def test_bs1_small(self):
        _, output_throughput, _ = run_bench_one_batch(
            DEFAULT_SMALL_MODEL_NAME_FOR_TEST, ["--cuda-graph-max-bs", "2"]
        )
        self.assertGreater(output_throughput, 50)

    def test_bs1_default(self):
        output_throughput = run_bench_offline_throughput(
            DEFAULT_MODEL_NAME_FOR_TEST, ["--cuda-graph-max-bs", "2"]
        )

        if is_in_ci():
            write_github_step_summary(
                f"### test_bs1_default (llama-3.1-8b)\n"
                f"output_throughput: {output_throughput:.2f} token/s\n"
            )
            self.assertGreater(output_throughput, 135)

    def test_moe_tp2_bs1(self):
        output_throughput = run_bench_offline_throughput(
            DEFAULT_MOE_MODEL_NAME_FOR_TEST, ["--tp", "2", "--cuda-graph-max-bs", "2"]
        )

        if is_in_ci():
            write_github_step_summary(
                f"### test_moe_tp2_bs1 (Mixtral-8x7B)\n"
                f"output_throughput: {output_throughput:.2f} token/s\n"
            )
            if is_in_amd_ci():
                self.assertGreater(output_throughput, 85)
            else:
                self.assertGreater(output_throughput, 125)

    def test_torch_compile_tp2_bs1(self):
        output_throughput = run_bench_offline_throughput(
            DEFAULT_MODEL_NAME_FOR_TEST,
            ["--tp", "2", "--enable-torch-compile", "--cuda-graph-max-bs", "2"],
        )

        if is_in_ci():
            write_github_step_summary(
                f"### test_torch_compile_tp2_bs1 (Mixtral-8x7B)\n"
                f"output_throughput: {output_throughput:.2f} token/s\n"
            )
            if is_in_amd_ci():
                self.assertGreater(output_throughput, 200)
            else:
                self.assertGreater(output_throughput, 220)


if __name__ == "__main__":
    unittest.main()
Add TP2 MOE benchmarks for AMD. (#5909) 2025-04-30 11:38:20 -07:00			`import os`
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00			`import unittest`

			`from sglang.test.test_utils import (`
			`DEFAULT_MODEL_NAME_FOR_TEST,`
			`DEFAULT_MOE_MODEL_NAME_FOR_TEST,`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`DEFAULT_SMALL_MODEL_NAME_FOR_TEST,`
Add retry for flaky tests in CI (#4755) 2025-03-26 07:53:12 +08:00			`CustomTestCase,`
Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00			`is_in_amd_ci,`
Enable torch.compile for triton backend (#1422) 2024-09-14 15:38:37 -07:00			`is_in_ci,`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`run_bench_offline_throughput,`
Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00			`run_bench_one_batch,`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`write_github_step_summary,`
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00			`)`

[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			# We use `run_bench_offline_throughput`` instead of `run_bench_one_batch` for most cases
			# because `run_bench_offline_throughput`` has overlap scheduler.

[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
Add retry for flaky tests in CI (#4755) 2025-03-26 07:53:12 +08:00			`class TestBenchOneBatch(CustomTestCase):`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00
			`def test_bs1_small(self):`
Add V2-lite model test (#7390) Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> 2025-07-04 13:25:50 +08:00			`_, output_throughput, _ = run_bench_one_batch(`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`DEFAULT_SMALL_MODEL_NAME_FOR_TEST, ["--cuda-graph-max-bs", "2"]`
			`)`
			`self.assertGreater(output_throughput, 50)`

			`def test_bs1_default(self):`
			`output_throughput = run_bench_offline_throughput(`
Test no vllm custom allreduce (#4210) 2025-03-08 05:23:06 -08:00			`DEFAULT_MODEL_NAME_FOR_TEST, ["--cuda-graph-max-bs", "2"]`
			`)`
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
Enable torch.compile for triton backend (#1422) 2024-09-14 15:38:37 -07:00			`if is_in_ci():`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`write_github_step_summary(`
[CI] Improve github summary & enable fa3 for more models (#5796) 2025-04-27 15:29:46 -07:00			`f"### test_bs1_default (llama-3.1-8b)\n"`
			`f"output_throughput: {output_throughput:.2f} token/s\n"`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`)`
Remove monkey_patch_vllm_dummy_weight_loader (#2064) 2024-11-17 15:48:12 -08:00			`self.assertGreater(output_throughput, 135)`
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`def test_moe_tp2_bs1(self):`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`output_throughput = run_bench_offline_throughput(`
Test no vllm custom allreduce (#4210) 2025-03-08 05:23:06 -08:00			`DEFAULT_MOE_MODEL_NAME_FOR_TEST, ["--tp", "2", "--cuda-graph-max-bs", "2"]`
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00			`)`

Enable torch.compile for triton backend (#1422) 2024-09-14 15:38:37 -07:00			`if is_in_ci():`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`write_github_step_summary(`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`f"### test_moe_tp2_bs1 (Mixtral-8x7B)\n"`
[CI] Improve github summary & enable fa3 for more models (#5796) 2025-04-27 15:29:46 -07:00			`f"output_throughput: {output_throughput:.2f} token/s\n"`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`)`
Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00			`if is_in_amd_ci():`
Add TP2 MOE benchmarks for AMD. (#5909) 2025-04-30 11:38:20 -07:00			`self.assertGreater(output_throughput, 85)`
			`else:`
			`self.assertGreater(output_throughput, 125)`
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`def test_torch_compile_tp2_bs1(self):`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`output_throughput = run_bench_offline_throughput(`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`DEFAULT_MODEL_NAME_FOR_TEST,`
			`["--tp", "2", "--enable-torch-compile", "--cuda-graph-max-bs", "2"],`
			`)`

			`if is_in_ci():`
			`write_github_step_summary(`
[CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00			`f"### test_torch_compile_tp2_bs1 (Mixtral-8x7B)\n"`
[CI] Improve github summary & enable fa3 for more models (#5796) 2025-04-27 15:29:46 -07:00			`f"output_throughput: {output_throughput:.2f} token/s\n"`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00			`)`
Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00			`if is_in_amd_ci():`
Fix GPU OOM (#6564) Co-authored-by: michael <michael.zhang@amd.com> 2025-05-25 07:38:39 +08:00			`self.assertGreater(output_throughput, 200)`
			`else:`
			`self.assertGreater(output_throughput, 220)`
Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00
[CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
			`if __name__ == "__main__":`
			`unittest.main()`