Commit Graph

28 Commits

Author SHA1 Message Date
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
Lifu Huang
3cf1473a09 Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-17 16:49:18 -07:00
Fr4nk1in
4bd2952a37 feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 (#6121)
Co-authored-by: King.Zevin <zevin@mail.ustc.edu.cn>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-16 14:44:10 -07:00
Lianmin Zheng
fba8eccd7e Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 00:17:33 -07:00
fzyzcjy
b6cf3532b5 Tiny refactor ModelConfig.from_server_args (#5219) 2025-05-08 01:02:43 -07:00
Ying Sheng
11383cec3c [PP] Add pipeline parallelism (#5724) 2025-04-30 18:18:07 -07:00
Lianmin Zheng
621e96bf9b [CI] Fix ci tests (#5769) 2025-04-27 07:18:10 -07:00
JieXin Liang
f55933e1cc [misc] more decode step log for batch_one_batch (#5565) 2025-04-26 19:50:28 -07:00
Baizhou Zhang
b54b5a96e4 [Doc]Add instruction for profiling with bench_one_batch (#5581) 2025-04-20 14:05:36 -07:00
fzyzcjy
d07e797ace Fix bench_one_batch producing unnatural results for expert parallel (#5149) 2025-04-20 00:38:04 -07:00
Cheng Wan
038bc5d521 Support --enable-llama4-multimodal (#5254) 2025-04-11 01:24:14 -07:00
fzyzcjy
61970b08d8 Let bench_one_batch support enable_dp_attention (#4058) 2025-04-08 23:44:25 -07:00
fzyzcjy
f01b092519 Super tiny fix typo (#4738) 2025-03-24 21:05:45 -07:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Hubert Lu
f8b28e461a Add CPU affinity setting to latency benchmark (#3085) 2025-01-25 23:52:05 -08:00
Lianmin Zheng
3d8f1c9bcf Use int64 as indices for set_kv_buffer (#3039) 2025-01-21 19:46:09 -08:00
Hongpeng Guo
583697cd71 [Enhancement] Custom Logit Processor Improvement (#2998)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-20 02:00:35 -08:00
Lianmin Zheng
03464890e0 Separate two entry points: Engine and HTTP server (#2996)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-01-19 22:09:24 -08:00
Yun Dai
e00e5385e0 add profiling to bench_one_batch script (#2821) 2025-01-16 07:24:24 -08:00
Lianmin Zheng
ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 02:09:08 -08:00
Lianmin Zheng
23e5e50fd5 Fix gemlite import (#2553) 2024-12-22 20:21:17 -08:00
Jerry Zhang
feb2b768ba Add integration with gemlite weight only quant (#2528) 2024-12-21 00:25:25 +08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
d4fc1a70e3 Crash the server correctly during error (#2231) 2024-11-28 00:22:39 -08:00
Lianmin Zheng
fed4c6946a Release v0.3.6.post2 (#2214)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-11-27 03:35:30 -08:00
Lianmin Zheng
5652c56535 Update CI threshold & Improve code style (#2159) 2024-11-24 06:29:38 -08:00
Ankur Neog
865233e256 Add initial support for intel Gaudi accelerators (#2121) 2024-11-22 20:22:23 -08:00
Lianmin Zheng
dfec7fca06 Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00