Commit Graph

166 Commits

Author SHA1 Message Date
Lianmin Zheng
b548801ddb Update docs (#1839) 2024-10-30 02:49:08 -07:00
Byron Hsu
680cad2023 fix get_memory_pool_size deadlock for DP (#1830) 2024-10-28 23:07:14 -07:00
Byron Hsu
6fcd6d7d6d Support token ids in engine.generate (#1820) 2024-10-27 14:02:34 -07:00
Lianmin Zheng
eaade87a42 Fix unit tests (#1817) 2024-10-27 03:04:54 -07:00
Lianmin Zheng
86fc0d79d0 Add a watch dog thread (#1816) 2024-10-27 02:00:50 -07:00
Ying Sheng
2fce449b1c [API] add get memory pool size (#1760)
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-10-23 07:02:29 +00:00
Lianmin Zheng
769bf11c05 Fix the race condition in overlap mode (#1712) 2024-10-19 06:50:56 -07:00
Lianmin Zheng
dd3809fad8 Fix engine unit test (#1701) 2024-10-17 09:53:32 -07:00
Lianmin Zheng
7feba41584 Fix failed ci tests on long prompts; Better error messages for embedding models (#1700) 2024-10-17 09:23:29 -07:00
Michael Feil
e5db40dcbc ORJson. Faster Json serialization (#1694) 2024-10-17 08:03:08 -07:00
Lianmin Zheng
02f7f3e488 Update the transformers version in CI (#1690) 2024-10-16 19:03:55 -07:00
Zeng Zhongchao
2782132be8 Add date to logging messages (#1623) (#1679) 2024-10-16 18:54:55 -07:00
Michael Feil
b0facb3316 add orjson for jsonresponse (#1688) 2024-10-16 18:14:30 -07:00
Lianmin Zheng
dbec2f1847 Launch a thread to overlap CPU and GPU (#1687) 2024-10-16 11:20:17 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Patrick Yi
31fad29ab0 Add get_tokenizer function for Engine class (#1653) 2024-10-12 19:39:35 -07:00
Byron Hsu
862cd265e5 [engine] support async and streaming (#1614) 2024-10-11 15:26:25 -07:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
科英
bbd72bfc86 Add the ability to enable and disable the Profiler via HTTP API. (#1626) 2024-10-11 02:34:25 -07:00
Byron Hsu
e8613df071 [Engine] Fix generate hanging issue after the first call (#1606) 2024-10-08 04:26:56 +00:00
Byron Hsu
565b05f02f Use atexit hook to implicitly shutdown Runtime (#1595) 2024-10-07 05:18:45 +00:00
Byron Hsu
551a3a9d38 Provide an offline engine API (#1567) 2024-10-06 20:27:03 -07:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Lianmin Zheng
63ba2f8d7b Clean up batch data structures: Introducing ModelWorkerBatch (#1544) 2024-09-30 06:41:49 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Lianmin Zheng
4e4459b91f Multiple minor fixes (#1530) 2024-09-28 14:43:35 -07:00
Ying Sheng
9aa6553d2a [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525) 2024-09-27 23:32:11 -07:00
HAI
3a6e04185b [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420) 2024-09-17 07:43:52 +00:00
Lianmin Zheng
27b557aea7 Clean up model loader (#1440) 2024-09-16 18:16:27 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Lianmin Zheng
c03cece42f Improve error reporting during server launch (#1390) 2024-09-11 04:50:04 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
josephrocca
dff2860a69 Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy (#1373)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-11 02:35:03 +10:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
0836055324 [Chore] Rename model_overide_args to model_override_args (#1284)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-01 03:14:56 -07:00
Lianmin Zheng
0a97d7962d [Fix] Fix OOM in llava base class (#1249) 2024-08-28 08:45:49 -07:00
Lianmin Zheng
bf53bf5142 [Fix] Fix llava on multi images (#1247) 2024-08-28 06:33:05 -07:00
Yineng Zhang
198974cd1a feat: support sm75 with FlashInfer v0.1.6 (#1233) 2024-08-28 18:39:12 +10:00
caiyueliang
2f1d92834f [FEAT] Support batches cancel (#1222)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 23:28:26 +00:00
Lianmin Zheng
902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) 2024-08-25 14:46:34 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00
Ying Sheng
1cb4da5c5f [Fix] the issue of random order when input is a list (#1199) 2024-08-24 21:43:03 -07:00
Lianmin Zheng
5623826f73 [Minor] Improve logging and rename the health check endpoint name (#1180) 2024-08-21 19:24:36 -07:00
Lianmin Zheng
bea2bb9eea Improve multi-node stability (#1171) 2024-08-20 22:35:05 -07:00
Shan Yu
cd10654e7e [Feat] Support update weights without restart server (#1157) 2024-08-20 13:48:24 -07:00
Lucien
6242c399ab Generate 1 token to verify the health of the inference service in /health (#1154)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-21 03:14:34 +10:00
yichuan~
b997a18d74 [Feat]Add support for optional start len of logprobs (#1035)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2024-08-18 23:45:41 -07:00
Lianmin Zheng
cdc8d60752 Improve the code style: more comments and remove useless packages (#1139) 2024-08-17 14:37:52 -07:00