Commit Graph

39 Commits

Author SHA1 Message Date
Lianmin Zheng
8496701934 [Misc] Fix metrics, weight update lock, request logging (#2543) 2024-12-22 06:27:22 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Lianmin Zheng
b548801ddb Update docs (#1839) 2024-10-30 02:49:08 -07:00
Lianmin Zheng
fc82f5a743 [Fix] Fix cuda graph padding for triton attention backend (#1782) 2024-10-24 12:33:15 -07:00
Lianmin Zheng
fbcbb26327 Fix perf regression for set_kv_buffer (#1765) 2024-10-23 09:57:08 -07:00
Lianmin Zheng
ad4125d1a9 Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
Lianmin Zheng
b48edff67f Split the overlapped version of TpModelWorkerClient into a separate file (#1726) 2024-10-20 00:29:29 -07:00
Lianmin Zheng
59cbf47626 Unify the memory pool api and tp worker API (#1724) 2024-10-19 23:19:26 -07:00
Lianmin Zheng
769bf11c05 Fix the race condition in overlap mode (#1712) 2024-10-19 06:50:56 -07:00
Lianmin Zheng
2bcfba1b08 Skip unnecessary penalizer (#1707) 2024-10-18 17:54:03 -07:00
Lianmin Zheng
bc12d4033f Add grouped free operations (#1706) 2024-10-18 13:21:05 -07:00
wxsm
b170930534 feat: radix tree code optimize (#1697) 2024-10-17 08:01:27 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Shuo Yang
061e546313 Support double sparsity (#1459) 2024-10-14 02:00:41 -07:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Lianmin Zheng
45473d4b2b Make input_ids a torch.Tensor (#1568) 2024-10-04 01:09:59 -07:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Lianmin Zheng
4ae0969c0a Move status check in the memory pool to CPU (#1557) 2024-10-02 18:23:35 -07:00
Lianmin Zheng
f86c1e611f Move scheduler code from tp_worker.py to scheduler.py (#1538) 2024-09-29 17:42:45 -07:00
luzengxiangcn
e6692bf4a5 debug radixcache stack_overflow (#1499) 2024-09-24 04:58:01 -07:00
Ke Bao
2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-25 17:38:11 -07:00
Lianmin Zheng
c877292cc1 Re-organize CI tests (#1052) 2024-08-12 03:39:01 -07:00
Liangsheng Yin
fb7421db0d minor: some potential bugs (#1044) 2024-08-12 05:35:44 +00:00
Liangsheng Yin
7de6034534 Fix the prefix indices (#1037) 2024-08-11 17:57:02 -07:00
Lianmin Zheng
9dae407812 Improve type annotation (#1029) 2024-08-11 02:44:59 -07:00
Liangsheng Yin
fcc0f5ed99 Fix wrong assert (#1028) 2024-08-11 09:22:16 +00:00
Liangsheng Yin
43fbb6d919 Fix input_ids && rename to fill_ids (#1021) 2024-08-10 16:24:12 -07:00
Liangsheng Yin
62757db6f0 Reduce the overhead when cache is disabled (#1010) 2024-08-09 16:36:57 -07:00
Liangsheng Yin
6ed4e3b8fb Fix chunked prefill (#984) 2024-08-07 22:28:42 -07:00
Liangsheng Yin
7623091d97 RadixCache method adjust (#977) 2024-08-07 15:52:24 -07:00
Zhiqiang Xie
6db27f7b3b misc: correct the int data type for token ids and indices (#969) 2024-08-08 04:40:07 +08:00
Liangsheng Yin
a01ddd9605 misc: fix the req_to_token member change (#967) 2024-08-07 01:52:10 -07:00
Liangsheng Yin
7fa54a1ab3 Make req_pool_indices on CPU (#960) 2024-08-07 01:41:25 -07:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
Liangsheng Yin
c020f9ceda Support chunked prefill when radix cache is disabled (#811) 2024-08-01 00:29:01 -07:00
Liangsheng Yin
cdcbde5fc3 Code structure refactor (#807) 2024-07-29 23:04:48 -07:00