Commit Graph

839 Commits

Author SHA1 Message Date
zifeitong
93dffd699b Add constrained_json_whitespace_pattern to ServerArgs (#1438) 2024-09-16 13:29:18 -07:00
Ying Sheng
2abe4f1cb6 Revert "[Minor] Raise exception for wrong import (#1409)" (#1432) 2024-09-15 15:22:32 -07:00
Ying Sheng
37963394aa [Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433) 2024-09-15 12:46:04 -07:00
Lianmin Zheng
899cf5c438 Remove deprecated configs (#1431) 2024-09-15 08:52:18 -07:00
Lianmin Zheng
e79f6cd73d Release v0.3.1 (#1430) 2024-09-15 23:03:16 +09:00
Lianmin Zheng
9ba1f09760 [Fix] Fix logprob and normalized_logprob (#1428) 2024-09-15 06:36:06 -07:00
Lianmin Zheng
282681b8a1 Update backend.md (#1429) 2024-09-15 02:55:34 -07:00
William Arnold
58cafe23a7 Add libibverbs-dev to Dockerfile (#1427) 2024-09-15 15:40:31 +09:00
Lianmin Zheng
9463bc1385 Enable torch.compile for triton backend (#1422) 2024-09-14 15:38:37 -07:00
Yineng Zhang
e3fc4658f4 fix: resolve nightly eval (#1426) 2024-09-15 02:07:52 +10:00
Ke Bao
33b54e7c40 Add pytorch sampling backend ut (#1425) 2024-09-15 01:15:30 +10:00
Jerry Zhang
30b404ce72 Add torchao quant for mixtral and qwen_moe (#1418) 2024-09-14 06:46:55 +00:00
Liangsheng Yin
70b6802982 Optimize conflicts between CUDA graph and vocab mask tensors (#1392) 2024-09-13 20:27:53 -07:00
Yineng Zhang
f3d32f888a ci: fix finish (#1414) 2024-09-14 01:01:30 +10:00
Lianmin Zheng
8779da95d6 Update pr-test.yml (#1412) 2024-09-13 00:37:13 -07:00
Lianmin Zheng
ad0ff62a4c Balance test in CI (#1411) 2024-09-12 23:29:44 -07:00
Ying Sheng
9a903a8784 [Minor] Raise exception for wrong import (#1409) 2024-09-12 23:02:36 -07:00
Lianmin Zheng
68be2f6d3b [CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
Lianmin Zheng
b912de11b0 Make stop reason a dict instead of str (#1407) 2024-09-12 20:47:31 -07:00
Ying Sheng
eb02c1618a [Minor, CI] remove lora test from minimal suite (#1406) 2024-09-12 16:49:50 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
hxer7963
c33d82a211 Add Support for XVERSE Models (Dense and MoE) to sglang (#1397)
Co-authored-by: will he <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-09-12 01:47:52 -07:00
Kaichen Zhang - NTU
8234e663e9 [Minor Fix] Fix llava modalities issue for single-image (#1402) 2024-09-12 01:10:26 -07:00
Zihao Ye
debbdb5178 kernel: use tensor cores for flashinfer gqa kernels (#1403) 2024-09-12 00:38:18 -07:00
Lianmin Zheng
3efa798116 Support cuda graph in the triton attention backend (#1401) 2024-09-12 00:36:55 -07:00
William
2a71be5e25 Fix README format (#1399) 2024-09-11 23:46:51 -07:00
Liangsheng Yin
4462137777 Add no commit to main rule (#1393) 2024-09-12 05:40:45 +08:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Lianmin Zheng
c03cece42f Improve error reporting during server launch (#1390) 2024-09-11 04:50:04 -07:00
Lianmin Zheng
15c75e4146 [Fix] Fix --disable-flashinfer (#1389) 2024-09-11 04:36:21 -07:00
Vectory
224200e3c2 BaiChuan2 Model (#1367)
Co-authored-by: wanpenghan <wanpenghan@sohu-inc.com>
2024-09-11 03:55:24 -07:00
Byron Hsu
8c0efa514d remove assertion in triton attention and add an unit test (#1385) 2024-09-11 03:22:07 -07:00
Liangsheng Yin
144bc70fcc Organize flashinfer indices update (#1378) 2024-09-10 17:38:59 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
Lianmin Zheng
3a6e8b6d78 [Minor] move triton attention kernels into a separate folder (#1379) 2024-09-10 15:15:08 -07:00
Liangsheng Yin
fbb4754cb8 Fix vocab mask update bug (#1376) 2024-09-10 13:10:36 -07:00
Lianmin Zheng
6c7cb90365 [Minor] improve kill scripts and torchao import (#1375) 2024-09-11 04:27:03 +10:00
josephrocca
dff2860a69 Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy (#1373)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-11 02:35:03 +10:00
William
e72275cf7f Support MiniCPM3 (#1371) 2024-09-10 19:57:52 +10:00
wangchao
fec2d1223c [Fix] fix bug of undefined is_single in meth create_abort_task (#1370) 2024-09-10 01:17:37 -07:00
Lianmin Zheng
8d1095dbf0 [Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00
Chayenne
743007e1ce Adding Documentation for installation (#1300)
Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com>
2024-09-09 19:09:13 -07:00
zifeitong
9144ed1067 Support OpenAI API json_schema response format (#1363) 2024-09-09 19:08:25 -07:00
Liangsheng Yin
69b3bb9ae1 Unify forward mode (#1360) 2024-09-09 13:49:29 -07:00
Ying Sheng
689ff588ec [CI] Return output logprobs in unit test (#1361) 2024-09-09 13:05:13 -07:00
Jerry Zhang
a7c47e0f02 Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-09-09 05:32:41 -07:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
Kaichen Zhang - NTU
662ecd9368 [Feat] Add modalities for vision server when handling pixel values for llava (#1346) 2024-09-09 02:07:34 -07:00
Byron Hsu
8e6bdf851c [triton] Support head_dim not 2^n in triton extend and decode attention (#1281) 2024-09-09 01:30:24 -07:00