Commit Graph

75 Commits

Author SHA1 Message Date
ybyang
e9fc2ac7b6 [PD Bug] fix MLA get_contiguous_buf_infos error (#5384) 2025-04-14 22:56:39 +08:00
huangtingwei
5fbafbb8f8 fix MLATokenToKVPoolHost get_size_per_token bug (#5161)
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
2025-04-13 12:37:26 -07:00
Teng Ma
7e4f72dd8c [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (#5204) 2025-04-10 20:05:34 +08:00
Zhiqiang Xie
3fadc64793 bug fix for hicache host eviction (#4989) 2025-04-02 00:33:50 -07:00
Zhiqiang Xie
e119f04215 Large page size aligned hierarchical caching (#4581) 2025-04-01 22:38:15 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
c1lovez1
93cf7fc5cd Unify variable naming: replace is_in_free_group with is_not_in_free_group (#4698) 2025-03-23 21:51:08 -07:00
Zhiqiang Xie
4d25305700 Move mem_state update into debug mode (#4525) 2025-03-23 00:52:27 -07:00
Byron Hsu
c7c7dbebbe [PD] Release initial code (#4654)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Ying1123 <sqy1415@gmail.com>
Co-authored-by: merrymercy <lianminzheng@gmail.com>
Co-authored-by: makro
Co-authored-by: dhou-xai
2025-03-21 14:47:47 -07:00
Zhiqiang Xie
a98290aea3 Unit test for Hierarchical Caching (#4486) 2025-03-17 17:45:00 -07:00
Zhiqiang Xie
f5bbf6037d Fix: Complete int32 to int64 conversion (#4465) 2025-03-16 18:14:27 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Lianmin Zheng
2c4f5ccac1 Fix minor style (#4460) 2025-03-15 21:51:12 -07:00
Wang Ran (汪然)
158430473e Fix typos (#4368) 2025-03-15 21:27:58 -07:00
Chen Shengzhi
86d9baedc2 [Fix] Fix errors when using the device except cuda. (#4455) 2025-03-15 16:33:00 -07:00
Lu Changqi
0e0ec70200 Hierarchical Caching supports MLA (#4009)
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-03-13 20:42:14 -07:00
Zhiqiang Xie
fbdb50501f Hot fix for hicache with new page aligned radixtree (#4397) 2025-03-13 15:50:49 -07:00
Lianmin Zheng
a5a892ffd3 Fix auto merge & add back get_flat_data_by_layer (#4393) 2025-03-13 08:46:25 -07:00
Lianmin Zheng
4fea040ca1 Fix a regression introduced by overlapping KV cache writing (#4375) 2025-03-13 03:49:05 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Zhiqiang Xie
10b544ae9b Hierarchical Caching Refactoring and Fixing TP issue (#4082) 2025-03-12 11:22:35 -07:00
Zhiqiang Xie
9376ac361d Memory pool fix for upstream change about eagle (#4170) 2025-03-07 00:58:20 -08:00
Zhiqiang Xie
aee30630d8 Add a pointer to the real KV cache pool (#4113) 2025-03-05 21:39:07 -08:00
luzengxiangcn
62b362b1f1 Debug radixcache: refactor recursive helper methods (#3029)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-03-05 16:11:42 -08:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Lianmin Zheng
e074d84e5b [Minor] more code cleanup (#4077) 2025-03-04 21:23:47 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Zhiqiang Xie
6c7a152c5a Hierarchical Caching for SGLang (#2693)
Co-authored-by: Wenxuan Tan <wenxuan.tan@wisc.edu>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-02-23 21:56:30 -08:00
Zhiqiang Xie
08104b56de Sanity check to prevent performance regression (#3171)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-01-27 12:28:17 -08:00
Lianmin Zheng
dc1881326f Fix perf regression on small batch sizes (#3008) 2025-01-20 03:39:49 -08:00
Lianmin Zheng
7906d1d298 Remove the unused write_with_records (#2972) 2025-01-18 20:20:23 -08:00
Lianmin Zheng
46d4431889 Add a new api configure_logging to allow dumping the requests (#2875) 2025-01-13 14:24:00 -08:00
fzyzcjy
923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) 2025-01-13 11:38:51 -08:00
Lianmin Zheng
72c7776355 Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-01-13 01:39:14 -08:00
bjmsong
0bb0f76311 Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
2025-01-12 21:17:11 -08:00
Zhiqiang Xie
51caee740f Host memory pool for hierarchical caching (#2771) 2025-01-07 21:38:37 +00:00
Lianmin Zheng
8496701934 [Misc] Fix metrics, weight update lock, request logging (#2543) 2024-12-22 06:27:22 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Lianmin Zheng
b548801ddb Update docs (#1839) 2024-10-30 02:49:08 -07:00
Lianmin Zheng
fc82f5a743 [Fix] Fix cuda graph padding for triton attention backend (#1782) 2024-10-24 12:33:15 -07:00
Lianmin Zheng
fbcbb26327 Fix perf regression for set_kv_buffer (#1765) 2024-10-23 09:57:08 -07:00
Lianmin Zheng
ad4125d1a9 Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
Lianmin Zheng
b48edff67f Split the overlapped version of TpModelWorkerClient into a separate file (#1726) 2024-10-20 00:29:29 -07:00
Lianmin Zheng
59cbf47626 Unify the memory pool api and tp worker API (#1724) 2024-10-19 23:19:26 -07:00
Lianmin Zheng
769bf11c05 Fix the race condition in overlap mode (#1712) 2024-10-19 06:50:56 -07:00
Lianmin Zheng
2bcfba1b08 Skip unnecessary penalizer (#1707) 2024-10-18 17:54:03 -07:00
Lianmin Zheng
bc12d4033f Add grouped free operations (#1706) 2024-10-18 13:21:05 -07:00
wxsm
b170930534 feat: radix tree code optimize (#1697) 2024-10-17 08:01:27 -07:00