Commit Graph

39 Commits

Author SHA1 Message Date
Lzhang-hub
4efe2c57c9 support vlm model spec bench (#10173) 2025-09-10 13:37:04 +08:00
Chayenne
9b08d975a0 [docs] Refactor, remove compiled results and add gpt-oss (#9613)
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
2025-08-25 15:27:06 -07:00
Lianmin Zheng
c480a3f6ea Minor style fixes for sgl-kernel (#9289) 2025-08-18 09:38:35 -07:00
Kay Yan
975a5ec69c [fix] update bench_speculative.py for compatibility (#7764)
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
2025-07-04 16:32:54 +08:00
Lianmin Zheng
22352d47a9 Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-06-29 23:16:19 -07:00
Lianmin Zheng
0f218731e3 Do not run frontend_reasoning.ipynb to reduce the CI load (#7073) 2025-06-10 17:15:31 -07:00
fzyzcjy
25be63d0b2 Auto handle PD disaggregation in bench_serving (#6587)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-25 22:41:27 -07:00
Byron Hsu
2d831c6ef9 [PD] Support structured output (#6560) 2025-05-23 21:49:00 -07:00
Byron Hsu
8233cc10fd [PD] Support logprob & Add failure test (#6558) 2025-05-23 14:29:20 -07:00
Yineng Zhang
eabcf82acb feat: add long context example (#6391) 2025-05-18 01:45:17 -07:00
Yineng Zhang
7282ab741a fix: update bench_speculative (#5649) 2025-04-22 16:08:15 -07:00
Byron Hsu
bf98d2e377 [PD] Support prefill overlap + Ensure no race condition (#5609) 2025-04-21 12:12:56 -07:00
Byron Hsu
deded17f38 [PD] Fix edge case and simplify large page size + chunked prefill (#5589) 2025-04-21 10:27:02 -07:00
Byron Hsu
c951d312ed [PD] Fix large page size + chunk prefill (#5588) 2025-04-20 17:21:54 -07:00
Baizhou Zhang
6fb29ffd9e Deprecate enable-flashinfer-mla and enable-flashmla (#5480) 2025-04-17 01:43:33 -07:00
lukec
a53fe428f9 Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-16 09:07:06 -07:00
Ke Bao
f1d09a6541 Update bench speculative script (#4235) 2025-03-09 12:19:01 -07:00
Adarsh Shirawalmath
19fd57bcd7 [docs] fix HF reference script command (#4148) 2025-03-06 13:21:54 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Yineng Zhang
bc6ad367c2 fix lint (#2733) 2025-01-05 14:45:42 +08:00
Ce Gao
f5d0865b25 feat: Support VLM in reference_hf (#2726)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
2025-01-03 22:32:30 +08:00
Ying Sheng
e1e595d702 [feat] Refactor session control interface and add CI (#2173) 2024-11-25 12:32:51 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Byron Hsu
30af7dfb34 [router] add base_gpu_id server args & merged radix tree python reference (#2115) 2024-11-21 17:13:33 -08:00
Lianmin Zheng
56a347f7d3 Move test_session_id.py to playground (#2104) 2024-11-20 01:28:27 -08:00
Ke Bao
62832bb272 Support cuda graph for DP attention (#2061) 2024-11-17 16:29:20 -08:00
Chayenne
c77c1e05ba fix black in pre-commit (#1940) 2024-11-08 07:42:47 +08:00
Xuehai Pan
a5e0defb5a minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 13:46:04 +00:00
Jani Monoses
916b3cdddc Allow passing dtype and max_new_tokens to HF reference script (#1903) 2024-11-03 08:24:37 -08:00
Ying Sheng
c5325aba75 [Profile] Add pytorch profiler (#1604) 2024-10-07 14:37:16 -07:00
Lianmin Zheng
fb2d0680e0 [Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510) 2024-09-24 21:37:33 -07:00
Lianmin Zheng
2854a5ea9f Fix the overhead due to penalizer in bench_latency (#1496) 2024-09-23 07:38:14 -07:00
Lianmin Zheng
167591e864 Better unit tests for adding a new model (#1488) 2024-09-22 01:50:37 -07:00
Ying Sheng
37963394aa [Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433) 2024-09-15 12:46:04 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Lianmin Zheng
f64eae3a29 [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308) 2024-09-02 21:44:45 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Ying Sheng
4075677621 Add OpenAI backend to the CI test (#869) 2024-08-01 09:25:24 -07:00