sglang

Author	SHA1	Message	Date
Ying Sheng	8b48496aaf	Revert "Revert "Add simple CPU offloading support"" (#2253 ) Co-authored-by: Jani Monoses <jani.monoses@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-11-28 23:58:54 -08:00
Ying Sheng	4057ea82c9	Revert "Add simple CPU offloading support" (#2252 ) We'll re-add the commit to correctly ack Kaichao's authorship	2024-11-28 23:36:55 -08:00
Lianmin Zheng	09798b36cd	Fix chunked prefill size for bench_offline_throughput (#2234 )	2024-11-27 23:37:20 -08:00
Lianmin Zheng	731146f6cb	Fix mixed chunked prefill in overlap mode (#2158 )	2024-11-24 07:17:37 -08:00
Jani Monoses	d98fa1e93d	Add simple CPU offloading support. (#2081 )	2024-11-23 06:23:53 +00:00
Ankur Neog	865233e256	Add initial support for intel Gaudi accelerators (#2121 )	2024-11-22 20:22:23 -08:00
Xuehai Pan	62a4a339eb	docs: fix module docstrings and copyright headers (#2077 )	2024-11-22 22:16:53 +08:00
Byron Hsu	30af7dfb34	[router] add base_gpu_id server args & merged radix tree python reference (#2115 )	2024-11-21 17:13:33 -08:00
Jerry Zhang	5c6a41facf	Error out when torchao-config option is not recognized (#2107 )	2024-11-20 17:37:28 -08:00
Lianmin Zheng	722530fa01	Enable overlap scheduler by default for the triton attention backend (#2105 )	2024-11-20 02:58:35 -08:00
Lianmin Zheng	7d671e4ad2	Enable overlap by default (#2067 )	2024-11-19 22:07:58 -08:00
Ke Bao	699384cb01	Set schedule policy more conservative for DP attention (#2096 )	2024-11-19 20:57:18 -08:00
Lianmin Zheng	ffd20fcd03	Make constrained decoding work for overlap scheduler (#2095 )	2024-11-19 15:04:43 -08:00
Lianmin Zheng	b110453802	Simplify logits penalizer (#2086 )	2024-11-18 17:48:28 -08:00
Lianmin Zheng	ebaa2f3199	Rename arguments `--disable-nan-detection` to `--enable-nan-detection` (#2066 )	2024-11-17 16:53:44 -08:00
Ke Bao	62832bb272	Support cuda graph for DP attention (#2061 )	2024-11-17 16:29:20 -08:00
Lianmin Zheng	11f881d173	Deprecate --disable-flashinfer and --disable-flashinfer-sampling (#2065 )	2024-11-17 16:20:58 -08:00
Lianmin Zheng	f719d9aebc	Launch dp ranks in parallel (#2053 ) Co-authored-by: Haotian Liu <6631389+haotian-liu@users.noreply.github.com>	2024-11-16 17:39:39 -08:00
Ke Bao	976bc302e5	Support DP MLA (#1970 )	2024-11-16 09:01:43 +00:00
HAI	2ffe0a7363	Add get_amdgpu_memory_capacity() (#2049 )	2024-11-15 22:51:48 -08:00
Lianmin Zheng	b01df48cf2	[Fix] Adjust default chunked prefill size and cuda graph max bs according to GPU memory capacity (#2044 )	2024-11-15 06:21:57 -08:00
Patrick Yi	13ce3e4b5d	Add download_dir ServerArgs property (#2027 )	2024-11-13 23:26:56 -08:00
Lianmin Zheng	ba069a24d3	Fix grammar backend (#2018 )	2024-11-12 21:17:38 -08:00
Lianmin Zheng	a509552087	[minor] Improve code style and compatibility (#1961 )	2024-11-08 02:19:41 -08:00
Chayenne	c77c1e05ba	fix black in pre-commit (#1940 )	2024-11-08 07:42:47 +08:00
Lzhang-hub	a146d9990e	support prometheus metrics (#1853 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>	2024-11-05 20:42:53 -08:00
Byron Hsu	a7a0a6886b	Make decode log interval configurable (#1847 )	2024-10-30 19:59:20 -07:00
Lianmin Zheng	86fc0d79d0	Add a watch dog thread (#1816 )	2024-10-27 02:00:50 -07:00
Lianmin Zheng	86e0dde555	Improve the user control of new_token_ratio (#1811 )	2024-10-26 16:39:41 -07:00
Lianmin Zheng	2b80978859	Provide an argument to set the maximum batch size for cuda graph (#1809 )	2024-10-26 15:09:33 -07:00
Liangsheng Yin	07bf2e846a	Allow consecutive ports when launching multiple sglang servers. (#1802 )	2024-10-26 06:43:24 +00:00
DarkSharpness	b77a02cdfd	[Performance] Support both xgrammar and outlines for constrained decoding (#1752 )	2024-10-25 21:47:02 +00:00
Lianmin Zheng	b121bc03a3	Simplify batch result resolution (#1735 )	2024-10-20 19:47:14 -07:00
Lianmin Zheng	f0f8a7699b	Simplify the nan detection and greedy check in sampler (#1709 )	2024-10-18 20:21:24 -07:00
havetc	ecb8bad276	Returning a per request metric for number of cached_tokens read (#1599 )	2024-10-16 11:49:22 -07:00
Lianmin Zheng	9116b2896f	Add a new event loop (#1677 )	2024-10-16 01:33:20 -07:00
Shuo Yang	061e546313	Support double sparsity (#1459 )	2024-10-14 02:00:41 -07:00
Lianmin Zheng	7ee6c259ff	Simplify the event loop and expose `--num-continuous-decode-steps` as an argument (#1652 )	2024-10-12 21:35:30 -07:00
Lianmin Zheng	9da5a60b18	Add an option to disable penalizer (#1651 )	2024-10-12 17:53:23 -07:00
Zhang, Liangang	5d638c92f5	[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480 )	2024-10-12 18:10:32 +00:00
Lianmin Zheng	23cc66f7b6	Add back data parallelism (#1635 )	2024-10-11 07:22:48 -07:00
glen-amd	58093b868f	Nit about the decorator of `PortArgs.init_new` (#1611 )	2024-10-11 02:17:47 -07:00
Zhang, Liangang	8275049ce3	Add device support (#1607 )	2024-10-11 02:05:58 -07:00
Jani Monoses	3ff641132e	Remove references to squeezellm (#1603 )	2024-10-07 11:30:41 -07:00
Lianmin Zheng	6a5b352aaf	Use is_flashinfer_available to replace is_hip for flashinfer check (#1596 ) Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>	2024-10-06 22:54:05 -07:00
Lianmin Zheng	114bbc8651	Use ipc instead of tcp in zmq (#1566 )	2024-10-04 00:45:52 -07:00
Lianmin Zheng	32eb6e96f2	Organize sampling batch info better (#1562 )	2024-10-03 18:29:49 -07:00
Ying Sheng	0f4fb19bc8	[Fix, LoRA] fix LoRA with updates in main (#1545 )	2024-09-30 10:06:08 -07:00
Xinyu Yang	acaffd233f	[Fix] fix ipv6 url when warm up model (#1537 )	2024-09-29 11:02:40 -07:00
Lianmin Zheng	048685430d	Improve process creation (#1534 )	2024-09-29 02:36:12 -07:00

1 2 3

149 Commits