Commit Graph

47 Commits

Author SHA1 Message Date
Ying Sheng
2fce449b1c [API] add get memory pool size (#1760)
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-10-23 07:02:29 +00:00
Byron Hsu
17536e7e3d Fix edge case for truncated (#1747) 2024-10-23 00:00:25 -04:00
Liangsheng Yin
5e1558f1f2 Update max_req_len and max_req_input_len (#1748) 2024-10-21 16:12:04 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
Liangsheng Yin
efb099cdee Fix prefill oom (#1743) 2024-10-21 03:54:35 -07:00
Lianmin Zheng
b121bc03a3 Simplify batch result resolution (#1735) 2024-10-20 19:47:14 -07:00
Lianmin Zheng
e12358dc91 Simplify the usage of device (#1734) 2024-10-20 18:17:41 -07:00
Lianmin Zheng
b48edff67f Split the overlapped version of TpModelWorkerClient into a separate file (#1726) 2024-10-20 00:29:29 -07:00
Lianmin Zheng
59cbf47626 Unify the memory pool api and tp worker API (#1724) 2024-10-19 23:19:26 -07:00
Lianmin Zheng
12cad0feae Simplify the interface of tp_worker (#1718) 2024-10-19 17:39:38 -07:00
Lianmin Zheng
769bf11c05 Fix the race condition in overlap mode (#1712) 2024-10-19 06:50:56 -07:00
Lianmin Zheng
2bcfba1b08 Skip unnecessary penalizer (#1707) 2024-10-18 17:54:03 -07:00
Lianmin Zheng
bc12d4033f Add grouped free operations (#1706) 2024-10-18 13:21:05 -07:00
Liangsheng Yin
9e0dac1ad7 Fix regex and logprob conflicts when chunked prefilling (#1703) 2024-10-17 18:33:21 -07:00
havetc
ecb8bad276 Returning a per request metric for number of cached_tokens read (#1599) 2024-10-16 11:49:22 -07:00
Lianmin Zheng
dbec2f1847 Launch a thread to overlap CPU and GPU (#1687) 2024-10-16 11:20:17 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Lianmin Zheng
f1088e0fc8 Fix memory leak during abort (#1674) 2024-10-15 08:15:08 -07:00
Lianmin Zheng
4a292f670d [Minor] Add some utility functions (#1671) 2024-10-14 20:08:03 -07:00
Lianmin Zheng
02bc95796d Simplify chunked prefill (#1667) 2024-10-14 06:47:50 -07:00
Lianmin Zheng
24f3e1511c [Minor] Improve style (#1666) 2024-10-14 05:25:00 -07:00
Lianmin Zheng
0c1e87964b Move filter_batch out of stream_output (#1663) 2024-10-14 01:15:34 -07:00
Lianmin Zheng
869f1c02c4 Add a test case to test retract (#1662) 2024-10-13 20:32:37 -07:00
Ying Sheng
2725f8da61 [Minor] Rename no_eos_trim to no_stop_trim (#1661) 2024-10-13 20:30:03 -07:00
Lianmin Zheng
da1ffed689 Add output_ids into ScheduleBatch (#1659) 2024-10-13 19:54:02 -07:00
Ying Sheng
4876117171 [Fix] fix eos trim inconsistency (#1650) 2024-10-13 01:07:09 -07:00
Lianmin Zheng
7ee6c259ff Simplify the event loop and expose --num-continuous-decode-steps as an argument (#1652) 2024-10-12 21:35:30 -07:00
Lianmin Zheng
9610fcd469 Fix the batch_is_full check for jump-forward decoding (#1654) 2024-10-12 19:47:24 -07:00
Lianmin Zheng
9da5a60b18 Add an option to disable penalizer (#1651) 2024-10-12 17:53:23 -07:00
Lianmin Zheng
69aa937aa5 Fix unit tests and type annotations (#1648) 2024-10-12 14:49:24 -07:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
科英
bbd72bfc86 Add the ability to enable and disable the Profiler via HTTP API. (#1626) 2024-10-11 02:34:25 -07:00
Byron Hsu
01fdb2f377 Fix test_vision_openai_server on CI (#1620) 2024-10-10 16:34:13 -07:00
Ying Sheng
c5325aba75 [Profile] Add pytorch profiler (#1604) 2024-10-07 14:37:16 -07:00
Lianmin Zheng
ebbc42d989 Optimize broadcast & Reorg code (#1598) 2024-10-07 13:19:23 -07:00
Lianmin Zheng
b6aad70ab1 [Fix] Fix the case where prompt_len = 0 (#1593) 2024-10-06 20:30:02 -07:00
Lianmin Zheng
58d1082e39 Clean up event loop (#1586) 2024-10-06 03:24:04 -07:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Liangsheng Yin
5d0ba4038f Refine the add request reasons to avoid corner cases. (#1574) 2024-10-04 18:00:18 -07:00
Ying Sheng
04b262cd91 [Fix] Fix major performance bug in certain cases (#1563)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-10-04 08:51:11 +00:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Lianmin Zheng
63ba2f8d7b Clean up batch data structures: Introducing ModelWorkerBatch (#1544) 2024-09-30 06:41:49 -07:00
Lianmin Zheng
36d5acfca5 Rename InputMetadata -> ForwardBatch (#1543) 2024-09-30 02:41:11 -07:00
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Lianmin Zheng
f86c1e611f Move scheduler code from tp_worker.py to scheduler.py (#1538) 2024-09-29 17:42:45 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00