Commit Graph

584 Commits

Author SHA1 Message Date
Yineng Zhang
13ac95b894 chore: bump v0.2.14.post2 (#1250) 2024-08-28 18:46:33 +00:00
Yineng Zhang
492143bf32 fix: resolve qwen2 moe weight loader (#1252) 2024-08-28 11:25:46 -07:00
Lianmin Zheng
0a97d7962d [Fix] Fix OOM in llava base class (#1249) 2024-08-28 08:45:49 -07:00
Yineng Zhang
c411f32e1c feat: replace GeluAndMul (#1234) 2024-08-28 14:07:02 +00:00
Lianmin Zheng
bf53bf5142 [Fix] Fix llava on multi images (#1247) 2024-08-28 06:33:05 -07:00
Yineng Zhang
b1a540ec42 feat: update GemmaRMSNorm (#1232) 2024-08-28 22:47:34 +10:00
Yineng Zhang
66975360e7 fix: increase max_new_tokens when testing generation models (#1244) 2024-08-28 22:12:36 +10:00
Yineng Zhang
f25f4dfde5 hotfix: revert sampler CUDA Graph (#1242) 2024-08-28 21:16:47 +10:00
Yineng Zhang
198974cd1a feat: support sm75 with FlashInfer v0.1.6 (#1233) 2024-08-28 18:39:12 +10:00
Lianmin Zheng
6cc38b2bf3 [Minor] Add more type annotations (#1237) 2024-08-28 00:54:26 -07:00
Liangsheng Yin
1ece2cda3d Fix bench latency benchmark (#1225) 2024-08-28 00:37:32 -07:00
Yineng Zhang
3602692c7c feat: replace get_act_fn for gpt_bigcode (#1231) 2024-08-27 21:15:31 +10:00
havetc
909f34363b [FIX] Wrong logger (#1230) 2024-08-27 20:10:46 +10:00
caiyueliang
2f1d92834f [FEAT] Support batches cancel (#1222)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 23:28:26 +00:00
havetc
9935f97b3e [FEAT] JSON constrained support (#1125)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 09:37:26 -07:00
Yineng Zhang
c5fe11a8e1 chore: bump v0.2.14 (#1155) 2024-08-27 00:28:24 +10:00
Liangsheng Yin
75ce37f401 Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 07:02:50 -07:00
Mingyi
97589a60a2 [CI] Parallelize unit tests in CI (#1219) 2024-08-26 04:54:02 +00:00
Liangsheng Yin
632d506d0b minor: improve CI and dependencies (#1212) 2024-08-26 04:26:31 +00:00
Kaichen Zhang - NTU
3579162ab1 [Fix] Multi-images loading error (#1218) 2024-08-26 03:58:51 +00:00
Mingyi
7514b9f8d3 [CI] Fix CI (#1217) 2024-08-26 02:56:42 +00:00
Mingyi
158e8f1e2d improve the threshold and ports in tests (#1215) 2024-08-25 19:02:08 -07:00
Ke Bao
2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-25 17:38:11 -07:00
Lianmin Zheng
15f1a49d2d Update CI workflows (#1210) 2024-08-25 16:43:07 -07:00
Ying Sheng
308d024092 [CI] Fix the issue of unit test hanging (#1211) 2024-08-25 16:21:37 -07:00
Lianmin Zheng
902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) 2024-08-25 14:46:34 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00
Kaichen Zhang - NTU
66e7dcaf70 [Fix] Fixing the multi-images error for llava-onevision (#1205) 2024-08-25 10:28:23 -07:00
Ying Sheng
1cb4da5c5f [Fix] the issue of random order when input is a list (#1199) 2024-08-24 21:43:03 -07:00
Ying Sheng
e61d13acdf [CI] Fix the problem of hf runner too slow (#1202) 2024-08-24 18:35:55 -07:00
Lianmin Zheng
f6af3a6561 Cleanup readme, llava examples, usage examples and nccl init (#1194) 2024-08-24 08:02:23 -07:00
Yineng Zhang
c9064e6fd9 feat: use gelu_tanh_and_mul (#1193) 2024-08-24 01:58:16 -07:00
Kaichen Zhang - NTU
a5b14ad043 [Feat/WIP] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. (#1123)
Co-authored-by: Bo Li <drluodian@gmail.com>
2024-08-23 14:11:16 -07:00
Ying Sheng
5fafcac008 Fix benchmark script (#1185) 2024-08-22 09:03:25 +00:00
Liangsheng Yin
364d3d72a7 Fix broken penalty (#1184) 2024-08-22 08:16:35 +00:00
Lianmin Zheng
5623826f73 [Minor] Improve logging and rename the health check endpoint name (#1180) 2024-08-21 19:24:36 -07:00
Liangsheng Yin
83e23c69b3 Improve code style of sampler (#1168) 2024-08-21 16:48:24 -07:00
intervitens
068e9eae55 Support min-p sampling (#1167) 2024-08-21 22:49:32 +00:00
rainred
d6aeb9fa15 [Feature] Add a function to convert sampling_params to kwargs (#1170)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-21 14:28:35 -07:00
Yineng Zhang
1fb9459908 fix: custom op fallback forward native when lower sm80 (#1177) 2024-08-21 14:26:35 -07:00
Lianmin Zheng
bea2bb9eea Improve multi-node stability (#1171) 2024-08-20 22:35:05 -07:00
Shan Yu
cd10654e7e [Feat] Support update weights without restart server (#1157) 2024-08-20 13:48:24 -07:00
Lucien
6242c399ab Generate 1 token to verify the health of the inference service in /health (#1154)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-21 03:14:34 +10:00
Yineng Zhang
04707b09b7 misc: add hypervisor vendor (#1165) 2024-08-21 02:14:51 +10:00
Xu-Chen
ff2cfdb1a2 [Feature] add disable-custom-all-reduce (#1148)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-20 08:44:12 -07:00
Lianmin Zheng
a8ae640328 Improve docs and warnings (#1164) 2024-08-20 08:31:29 -07:00
Juwan Yoo
d8476818ef feat: allow streaming for multi-prompt and/or parallel sampling (#1134) 2024-08-20 08:06:55 -07:00
Ke Bao
df191254ab Optimize MLA/GQA/MQA Triton decoding (#1138)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-19 20:23:07 +10:00
yichuan~
b997a18d74 [Feat]Add support for optional start len of logprobs (#1035)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2024-08-18 23:45:41 -07:00
min-xu-et
fa13b95d6b fixed a typo (#1143) 2024-08-18 14:29:09 -07:00