sglang

Author	SHA1	Message	Date
ybyang	e9fc2ac7b6	[PD Bug] fix MLA get_contiguous_buf_infos error (#5384 )	2025-04-14 22:56:39 +08:00
huangtingwei	5fbafbb8f8	fix MLATokenToKVPoolHost get_size_per_token bug (#5161 ) Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>	2025-04-13 12:37:26 -07:00
Teng Ma	7e4f72dd8c	[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (#5204 )	2025-04-10 20:05:34 +08:00
Zhiqiang Xie	3fadc64793	bug fix for hicache host eviction (#4989 )	2025-04-02 00:33:50 -07:00
Zhiqiang Xie	e119f04215	Large page size aligned hierarchical caching (#4581 )	2025-04-01 22:38:15 -07:00
Lianmin Zheng	b26bc86b36	Support page size > 1 + eagle (#4908 )	2025-03-30 00:46:23 -07:00
c1lovez1	93cf7fc5cd	Unify variable naming: replace is_in_free_group with is_not_in_free_group (#4698 )	2025-03-23 21:51:08 -07:00
Zhiqiang Xie	4d25305700	Move mem_state update into debug mode (#4525 )	2025-03-23 00:52:27 -07:00
Byron Hsu	c7c7dbebbe	[PD] Release initial code (#4654 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Ying1123 <sqy1415@gmail.com> Co-authored-by: merrymercy <lianminzheng@gmail.com> Co-authored-by: makro Co-authored-by: dhou-xai	2025-03-21 14:47:47 -07:00
Zhiqiang Xie	a98290aea3	Unit test for Hierarchical Caching (#4486 )	2025-03-17 17:45:00 -07:00
Zhiqiang Xie	f5bbf6037d	Fix: Complete int32 to int64 conversion (#4465 )	2025-03-16 18:14:27 -07:00
JieXin Liang	1a3fa75f2f	[Fix] use `torch.cat` instead of `torch.concat` to prevent entering the `Autograd` backends. (#4466 )	2025-03-16 00:02:47 -07:00
Lianmin Zheng	2c4f5ccac1	Fix minor style (#4460 )	2025-03-15 21:51:12 -07:00
Wang Ran (汪然)	158430473e	Fix typos (#4368 )	2025-03-15 21:27:58 -07:00
Chen Shengzhi	86d9baedc2	[Fix] Fix errors when using the device except cuda. (#4455 )	2025-03-15 16:33:00 -07:00
Lu Changqi	0e0ec70200	Hierarchical Caching supports MLA (#4009 ) Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-03-13 20:42:14 -07:00
Zhiqiang Xie	fbdb50501f	Hot fix for hicache with new page aligned radixtree (#4397 )	2025-03-13 15:50:49 -07:00
Lianmin Zheng	a5a892ffd3	Fix auto merge & add back get_flat_data_by_layer (#4393 )	2025-03-13 08:46:25 -07:00
Lianmin Zheng	4fea040ca1	Fix a regression introduced by overlapping KV cache writing (#4375 )	2025-03-13 03:49:05 -07:00
Lianmin Zheng	c76040e31b	Support page size > 1 (#4356 )	2025-03-12 22:22:39 -07:00
Zhiqiang Xie	10b544ae9b	Hierarchical Caching Refactoring and Fixing TP issue (#4082 )	2025-03-12 11:22:35 -07:00
Zhiqiang Xie	9376ac361d	Memory pool fix for upstream change about eagle (#4170 )	2025-03-07 00:58:20 -08:00
Zhiqiang Xie	aee30630d8	Add a pointer to the real KV cache pool (#4113 )	2025-03-05 21:39:07 -08:00
luzengxiangcn	62b362b1f1	Debug radixcache: refactor recursive helper methods (#3029 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-03-05 16:11:42 -08:00
Ying Sheng	d3d4d76758	[Eagle] Refactor eagle speculative decoding (#3986 ) Co-authored-by: Ke Bao <ISPObaoke@163.com>	2025-03-05 08:06:07 -08:00
Lianmin Zheng	e074d84e5b	[Minor] more code cleanup (#4077 )	2025-03-04 21:23:47 -08:00
Lianmin Zheng	ac2387279e	Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>	2025-03-03 00:12:04 -08:00
Zhiqiang Xie	6c7a152c5a	Hierarchical Caching for SGLang (#2693 ) Co-authored-by: Wenxuan Tan <wenxuan.tan@wisc.edu> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2025-02-23 21:56:30 -08:00
Zhiqiang Xie	08104b56de	Sanity check to prevent performance regression (#3171 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2025-01-27 12:28:17 -08:00
Lianmin Zheng	dc1881326f	Fix perf regression on small batch sizes (#3008 )	2025-01-20 03:39:49 -08:00
Lianmin Zheng	7906d1d298	Remove the unused write_with_records (#2972 )	2025-01-18 20:20:23 -08:00
Lianmin Zheng	46d4431889	Add a new api configure_logging to allow dumping the requests (#2875 )	2025-01-13 14:24:00 -08:00
fzyzcjy	923f518337	CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630 )	2025-01-13 11:38:51 -08:00
Lianmin Zheng	72c7776355	Fix linear.py and improve weight loading (#2851 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-01-13 01:39:14 -08:00
bjmsong	0bb0f76311	Support FP8 E4M3 KV Cache (#2786 ) Co-authored-by: root <bjmsong@126.com>	2025-01-12 21:17:11 -08:00
Zhiqiang Xie	51caee740f	Host memory pool for hierarchical caching (#2771 )	2025-01-07 21:38:37 +00:00
Lianmin Zheng	8496701934	[Misc] Fix metrics, weight update lock, request logging (#2543 )	2024-12-22 06:27:22 -08:00
SangBin Cho	9208618b3e	[Core] in batch prefix caching by delay scheduling (#2442 )	2024-12-11 12:51:50 -08:00
Qun Yang	37ee906f61	Add more support for intel Gaudi accelerators (#2357 )	2024-12-06 01:16:33 -08:00
Lianmin Zheng	b548801ddb	Update docs (#1839 )	2024-10-30 02:49:08 -07:00
Lianmin Zheng	fc82f5a743	[Fix] Fix cuda graph padding for triton attention backend (#1782 )	2024-10-24 12:33:15 -07:00
Lianmin Zheng	fbcbb26327	Fix perf regression for set_kv_buffer (#1765 )	2024-10-23 09:57:08 -07:00
Lianmin Zheng	ad4125d1a9	Fuse more ops & Simplify token mapping (#1758 )	2024-10-22 23:20:43 -07:00
Liangsheng Yin	94cde10920	Llama3.2 vision model support (#1551 )	2024-10-21 15:01:21 -07:00
Lianmin Zheng	b48edff67f	Split the overlapped version of TpModelWorkerClient into a separate file (#1726 )	2024-10-20 00:29:29 -07:00
Lianmin Zheng	59cbf47626	Unify the memory pool api and tp worker API (#1724 )	2024-10-19 23:19:26 -07:00
Lianmin Zheng	769bf11c05	Fix the race condition in overlap mode (#1712 )	2024-10-19 06:50:56 -07:00
Lianmin Zheng	2bcfba1b08	Skip unnecessary penalizer (#1707 )	2024-10-18 17:54:03 -07:00
Lianmin Zheng	bc12d4033f	Add grouped free operations (#1706 )	2024-10-18 13:21:05 -07:00
wxsm	b170930534	feat: radix tree code optimize (#1697 )	2024-10-17 08:01:27 -07:00

1 2

75 Commits