Commit Graph

168 Commits

Author SHA1 Message Date
Mick
583d6af71b example: add vlm to token in & out example (#3941)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-04 22:18:26 -08:00
Lianmin Zheng
77a3954bf7 Simplify eagle tests and TP sync in grammar backend (#4066) 2025-03-04 13:40:40 -08:00
Xihuai Wang
95575aa76a Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
2025-03-03 21:16:36 -08:00
Qiaolin Yu
57a404fd55 Remove outdated test utils and fix links for the doc of sampling params (#3999) 2025-03-03 09:41:38 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
1a8f995c46 remove cache configs in model definitions (#4031) 2025-03-03 05:00:50 -08:00
Lianmin Zheng
66301e124f Improve code styles (#4021) 2025-03-03 03:20:23 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
fzyzcjy
e3e0bc50a9 [Feature] SPMD for SGLang + Verl (#3852) 2025-02-28 09:53:10 -08:00
lukec
21463e321a Expert Parallelism (EP) Support for DeepSeek V3/R1 (#3602)
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: laixin <q865809639@gmail.com>
2025-02-26 02:29:37 -08:00
Shenggui Li
c0bb9eb3b3 [improve] made timeout configurable (#3803) 2025-02-25 00:26:08 -08:00
aoshen524
e79f7420be [Fix] Fix bugs and refactor codes in lora for better scalability. (#3652)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-20 11:51:57 -08:00
Baizhou Zhang
70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-03 22:09:13 -08:00
Lianmin Zheng
53cef81587 Improve weight loading and code style (#3174) 2025-01-27 03:00:41 -08:00
Lianmin Zheng
a4331cd260 Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00
Lianmin Zheng
287d07a669 Misc fixes for eagle (flush_cache, CPU overhead) (#3014) 2025-01-20 20:27:38 -08:00
Lianmin Zheng
60b2a44a80 Fix flaky tests in test_programs.py (#3022) 2025-01-20 16:50:39 -08:00
Lianmin Zheng
03464890e0 Separate two entry points: Engine and HTTP server (#2996)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-01-19 22:09:24 -08:00
Lianmin Zheng
61f42b5732 Move sgl.Runtime under sglang/lang (#2990) 2025-01-19 17:10:29 -08:00
Mick
3d93f84a00 [Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-01-18 14:14:19 -08:00
bjmsong
d3024f4fc8 support e4m3 kvcache in qwen2 & add kv scaling facotr json (#2894)
Co-authored-by: bjmsong <bjmsong@126.com>
2025-01-18 11:43:22 +08:00
Lianmin Zheng
8f2c522aba Improve benchmark scripts and error message printing (#2922) 2025-01-16 06:24:31 -08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
Lianmin Zheng
b22f3f6475 Fix nightly accuracy tests (#2780) 2025-01-07 21:02:35 -08:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00
Xingyao Wang
1acbaf1b5a Add generator-style run_batch function (#2513)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-06 15:04:55 -08:00
HandH1998
53aed988cb Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-12-26 00:02:14 +08:00
Lianmin Zheng
a6ca736c8e Simplify stream_output (#2398) 2024-12-08 12:27:13 -08:00
Lianmin Zheng
a2486eb58f Fix a bug with logprob streaming + chunked prefill (#2403) 2024-12-08 03:55:27 -08:00
Lianmin Zheng
ccaf1f997c [CI] Print summary on github actions (#2274) 2024-11-29 23:48:54 -08:00
Chayenne
7d5d1d3d29 udate weights from disk (#2265) 2024-11-30 01:17:00 +00:00
bjmsong
01017d4c20 Support LoRA in Completion API (#2243)
Co-authored-by: root <bjmsong@126.com>
2024-11-29 16:13:38 -08:00
Lianmin Zheng
b2ccf36d4d Fix memory leak during abort (#2238) 2024-11-28 02:22:15 -08:00
Lianmin Zheng
d4fc1a70e3 Crash the server correctly during error (#2231) 2024-11-28 00:22:39 -08:00
Lianmin Zheng
fb6e04a0c2 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2222) 2024-11-27 02:52:46 -08:00
Lianmin Zheng
6997e28f6e Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" (#2221) 2024-11-27 02:02:01 -08:00
Lianmin Zheng
a0e58740a8 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2217) 2024-11-27 01:13:41 -08:00
Lianmin Zheng
8e1adb8441 Allow overwrite flashinfer use_tensorcore (#2169) 2024-11-24 20:58:17 -08:00
Lianmin Zheng
c211e7b669 Simplify batch update (#2154) 2024-11-24 04:47:10 -08:00
Byron Hsu
cbedd1db1d [router] cache-aware load-balancing router v1 (#2114) 2024-11-23 08:34:48 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Yineng Zhang
4f8c3aeafc minor: update gsm8k threshold (#2125) 2024-11-22 19:23:58 +08:00
bjmsong
ad30d5cf9a Benchmark with Pytorch Profiler easily (#2110)
Co-authored-by: root <bjmsong@126.com>
2024-11-21 23:29:50 -08:00
Lianmin Zheng
dfec7fca06 Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00
James Xu
f6f713797b Add support for Qwen2-VL-based embedding models (#2055) 2024-11-21 14:24:25 -08:00
Lianmin Zheng
7d671e4ad2 Enable overlap by default (#2067) 2024-11-19 22:07:58 -08:00
Lianmin Zheng
b110453802 Simplify logits penalizer (#2086) 2024-11-18 17:48:28 -08:00
Yineng Zhang
766192610e feat: update torch 2.5.1 (#2069) 2024-11-18 21:29:13 +08:00
ws
29ebe3dff4 fix: align enable_overlap_scheduler naming between code and docs (#2038) 2024-11-15 03:39:10 -08:00
Lianmin Zheng
aae5434bdf Fix unit tests (#2034) 2024-11-14 11:08:37 -08:00