sglang

Author	SHA1	Message	Date
Lianmin Zheng	751e5ca273	[minor] clean up docs and eos id (#2622 )	2024-12-27 11:23:46 -08:00
Yang Zheng	7a7ac6bea1	[FIX] Update EOS from config (#2475 )	2024-12-27 10:59:56 -08:00
fzyzcjy	3169e66c23	Fix duplicated handling of GetWeightsByNameReqInput (#2565 )	2024-12-26 06:49:32 -08:00
Lianmin Zheng	773951548d	Fix logprob_start_len for multi modal models (#2597 ) Co-authored-by: libra <lihu723@gmail.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>	2024-12-26 06:27:45 -08:00
Adarsh Shirawalmath	acb340728c	[Feature] Support new parameter - EBNF in xgrammar (#2526 )	2024-12-26 05:12:41 -08:00
Liangsheng Yin	e7ebecf82e	Fix cache hit rate when chunked prefill (#2555 )	2024-12-26 03:14:28 -08:00
Lianmin Zheng	8496701934	[Misc] Fix metrics, weight update lock, request logging (#2543 )	2024-12-22 06:27:22 -08:00
SangBin Cho	9208618b3e	[Core] in batch prefix caching by delay scheduling (#2442 )	2024-12-11 12:51:50 -08:00
Lianmin Zheng	0ce091a82d	[Minor] Improve code style (#2419 )	2024-12-09 03:05:59 -08:00
Lianmin Zheng	a6ca736c8e	Simplify stream_output (#2398 )	2024-12-08 12:27:13 -08:00
Lianmin Zheng	cc858953a0	Fix recv_requests (#2405 )	2024-12-08 04:08:04 -08:00
Lianmin Zheng	a2486eb58f	Fix a bug with logprob streaming + chunked prefill (#2403 )	2024-12-08 03:55:27 -08:00
SangBin Cho	1f09e84b9a	nit: Remove busy waiting on scheduler (#2382 )	2024-12-08 01:06:15 -08:00
Lianmin Zheng	0e7409adb6	Fix the overlap for xgrammar (#2377 )	2024-12-06 05:49:29 -08:00
Qun Yang	37ee906f61	Add more support for intel Gaudi accelerators (#2357 )	2024-12-06 01:16:33 -08:00
Yineng Zhang	85e1a6f3aa	Update model_loader deps and qqq quantization deps (#2220 ) (#2318 ) Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-02 23:22:13 +08:00
Lianmin Zheng	3c79ad35ca	[Fix] Fix the padded hash value for image tokens (#2309 )	2024-12-01 23:36:28 -08:00
Chayenne	983bfcf386	Online weight updates from torch.distributed (#2279 )	2024-12-01 23:23:18 -08:00
Liangsheng Yin	5f12f0e7af	Fix chunked prefill when ignore eos (#2290 )	2024-12-01 00:37:53 -08:00
Chayenne	7d1485d376	Add get weights by parameter name for llama (#2266 )	2024-11-29 23:36:38 -08:00
Chayenne	7d5d1d3d29	udate weights from disk (#2265 )	2024-11-30 01:17:00 +00:00
Lianmin Zheng	94e167ea5a	Fix the default chunked prefill size (#2268 )	2024-11-29 16:03:32 -08:00
Lianmin Zheng	afe1e46586	[Minor] fix the style for multimodal models (#2257 )	2024-11-29 04:24:20 -08:00
Lianmin Zheng	f50a6cf443	Fix hash collision for multi modal models (#2256 )	2024-11-29 03:15:58 -08:00
Lianmin Zheng	fe97a2d40f	Simplify tokenizer manager (#2254 )	2024-11-29 02:18:51 -08:00
Lianmin Zheng	b2ccf36d4d	Fix memory leak during abort (#2238 )	2024-11-28 02:22:15 -08:00
Lianmin Zheng	d4fc1a70e3	Crash the server correctly during error (#2231 )	2024-11-28 00:22:39 -08:00
Lianmin Zheng	fb915bd1a2	Disable overlap scheduler for multimodal models (#2235 )	2024-11-27 23:44:33 -08:00
Lianmin Zheng	2a02185c5f	Rename DP_RANK to SGLANG_DP_RANK (#2218 )	2024-11-27 09:36:36 -08:00
Lianmin Zheng	fb6e04a0c2	Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2222 )	2024-11-27 02:52:46 -08:00
Lianmin Zheng	6997e28f6e	Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" (#2221 )	2024-11-27 02:02:01 -08:00
Lianmin Zheng	a0e58740a8	Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2217 )	2024-11-27 01:13:41 -08:00
Ying Sheng	37c8a5761f	[feat] Support session control for vision language models (#2210 )	2024-11-27 00:03:29 -08:00
Lianmin Zheng	1605ae121e	[CI] Minor fix for CI (#2187 )	2024-11-25 16:38:43 -08:00
Rin Intachuen	1aea19f64b	Input_embeds support (#2052 )	2024-11-25 16:35:04 -08:00
HAI	10189d08dd	[Performance]: Process affinity to CPU cores with multiple sockets support (#2171 )	2024-11-25 14:57:32 -08:00
Ying Sheng	e1e595d702	[feat] Refactor session control interface and add CI (#2173 )	2024-11-25 12:32:51 -08:00
Lianmin Zheng	8e1adb8441	Allow overwrite flashinfer use_tensorcore (#2169 )	2024-11-24 20:58:17 -08:00
Lianmin Zheng	731146f6cb	Fix mixed chunked prefill in overlap mode (#2158 )	2024-11-24 07:17:37 -08:00
Lianmin Zheng	5652c56535	Update CI threshold & Improve code style (#2159 )	2024-11-24 06:29:38 -08:00
Lianmin Zheng	c211e7b669	Simplify batch update (#2154 )	2024-11-24 04:47:10 -08:00
Byron Hsu	52f58fc42a	fix dp_rank env (#2144 )	2024-11-23 11:46:21 -08:00
Lianmin Zheng	751c3a037c	Fix dp print message (#2138 )	2024-11-23 01:22:26 -08:00
Lianmin Zheng	66d4859acf	Revert "Only stream output on tp rank 0" (#2130 )	2024-11-22 15:46:16 -08:00
Lianmin Zheng	e1b63624d7	Only stream output on tp rank 0 (#2124 )	2024-11-22 15:13:44 -08:00
Henry Hyeonmok Ko	c35cd1f8c7	Expose max total num tokens from Runtime & Engine API (#2092 )	2024-11-22 15:10:10 -08:00
Xuehai Pan	62a4a339eb	docs: fix module docstrings and copyright headers (#2077 )	2024-11-22 22:16:53 +08:00
Jake Poznanski	8048c28c11	Fix #2037 - Context length check does not take into out pad tokens for visual models (#2106 )	2024-11-21 19:05:41 -08:00
Byron Hsu	30af7dfb34	[router] add base_gpu_id server args & merged radix tree python reference (#2115 )	2024-11-21 17:13:33 -08:00
Lianmin Zheng	722530fa01	Enable overlap scheduler by default for the triton attention backend (#2105 )	2024-11-20 02:58:35 -08:00

1 2 3

132 Commits