Commit Graph

206 Commits

Author SHA1 Message Date
Lianmin Zheng
02f7f3e488 Update the transformers version in CI (#1690) 2024-10-16 19:03:55 -07:00
Zeng Zhongchao
2782132be8 Add date to logging messages (#1623) (#1679) 2024-10-16 18:54:55 -07:00
Michael Feil
b0facb3316 add orjson for jsonresponse (#1688) 2024-10-16 18:14:30 -07:00
Lianmin Zheng
dbec2f1847 Launch a thread to overlap CPU and GPU (#1687) 2024-10-16 11:20:17 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Patrick Yi
31fad29ab0 Add get_tokenizer function for Engine class (#1653) 2024-10-12 19:39:35 -07:00
Byron Hsu
862cd265e5 [engine] support async and streaming (#1614) 2024-10-11 15:26:25 -07:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
科英
bbd72bfc86 Add the ability to enable and disable the Profiler via HTTP API. (#1626) 2024-10-11 02:34:25 -07:00
Byron Hsu
e8613df071 [Engine] Fix generate hanging issue after the first call (#1606) 2024-10-08 04:26:56 +00:00
Byron Hsu
565b05f02f Use atexit hook to implicitly shutdown Runtime (#1595) 2024-10-07 05:18:45 +00:00
Byron Hsu
551a3a9d38 Provide an offline engine API (#1567) 2024-10-06 20:27:03 -07:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Lianmin Zheng
63ba2f8d7b Clean up batch data structures: Introducing ModelWorkerBatch (#1544) 2024-09-30 06:41:49 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Lianmin Zheng
4e4459b91f Multiple minor fixes (#1530) 2024-09-28 14:43:35 -07:00
Ying Sheng
9aa6553d2a [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525) 2024-09-27 23:32:11 -07:00
HAI
3a6e04185b [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420) 2024-09-17 07:43:52 +00:00
Lianmin Zheng
27b557aea7 Clean up model loader (#1440) 2024-09-16 18:16:27 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Lianmin Zheng
c03cece42f Improve error reporting during server launch (#1390) 2024-09-11 04:50:04 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
josephrocca
dff2860a69 Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy (#1373)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-11 02:35:03 +10:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
0836055324 [Chore] Rename model_overide_args to model_override_args (#1284)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-01 03:14:56 -07:00
Lianmin Zheng
0a97d7962d [Fix] Fix OOM in llava base class (#1249) 2024-08-28 08:45:49 -07:00
Lianmin Zheng
bf53bf5142 [Fix] Fix llava on multi images (#1247) 2024-08-28 06:33:05 -07:00
Yineng Zhang
198974cd1a feat: support sm75 with FlashInfer v0.1.6 (#1233) 2024-08-28 18:39:12 +10:00
caiyueliang
2f1d92834f [FEAT] Support batches cancel (#1222)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 23:28:26 +00:00
Lianmin Zheng
902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) 2024-08-25 14:46:34 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00
Ying Sheng
1cb4da5c5f [Fix] the issue of random order when input is a list (#1199) 2024-08-24 21:43:03 -07:00
Lianmin Zheng
5623826f73 [Minor] Improve logging and rename the health check endpoint name (#1180) 2024-08-21 19:24:36 -07:00
Lianmin Zheng
bea2bb9eea Improve multi-node stability (#1171) 2024-08-20 22:35:05 -07:00
Shan Yu
cd10654e7e [Feat] Support update weights without restart server (#1157) 2024-08-20 13:48:24 -07:00
Lucien
6242c399ab Generate 1 token to verify the health of the inference service in /health (#1154)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-21 03:14:34 +10:00
yichuan~
b997a18d74 [Feat]Add support for optional start len of logprobs (#1035)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2024-08-18 23:45:41 -07:00
Lianmin Zheng
cdc8d60752 Improve the code style: more comments and remove useless packages (#1139) 2024-08-17 14:37:52 -07:00
Liangsheng Yin
3694f8f996 Mixed style of chunked prefill (#1013) 2024-08-16 09:13:00 +00:00
Lianmin Zheng
0cb099e20a set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113) 2024-08-16 03:47:39 +10:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Ying Sheng
6767e2229f Support jinja as chat template file (#1104) 2024-08-14 17:43:14 -07:00
rainred
616b59f384 [Feature] modify Runtime to support skip_tokenizer_init (#1088)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-14 00:28:04 -07:00
Lianmin Zheng
c877292cc1 Re-organize CI tests (#1052) 2024-08-12 03:39:01 -07:00
Yineng Zhang
94752ac811 feat: use FlashInfer rmsnorm and silu (#907) 2024-08-11 14:57:13 +10:00
gryffindor-rr
9cf0a5bada Add skip_tokenizer_init args. (#959)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-09 12:14:13 -07:00
Ying Sheng
b16e856f11 Add openai embedding API (#997) 2024-08-09 11:19:18 -07:00
liuyhwangyh
b91a4cb1b1 support models from www.modelscope.cn (#994)
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
2024-08-09 02:52:14 -07:00