Commit Graph

540 Commits

Author SHA1 Message Date
Xu-Chen
ff2cfdb1a2 [Feature] add disable-custom-all-reduce (#1148)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-20 08:44:12 -07:00
Lianmin Zheng
a8ae640328 Improve docs and warnings (#1164) 2024-08-20 08:31:29 -07:00
Juwan Yoo
d8476818ef feat: allow streaming for multi-prompt and/or parallel sampling (#1134) 2024-08-20 08:06:55 -07:00
Ke Bao
df191254ab Optimize MLA/GQA/MQA Triton decoding (#1138)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-19 20:23:07 +10:00
yichuan~
b997a18d74 [Feat]Add support for optional start len of logprobs (#1035)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2024-08-18 23:45:41 -07:00
min-xu-et
fa13b95d6b fixed a typo (#1143) 2024-08-18 14:29:09 -07:00
Lianmin Zheng
3c1f5a9220 Fix duplicated imports in hf_transformers_utils.py (#1141) 2024-08-17 18:03:00 -07:00
Lianmin Zheng
57d0bd91ec Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
Lianmin Zheng
cdc8d60752 Improve the code style: more comments and remove useless packages (#1139) 2024-08-17 14:37:52 -07:00
Yineng Zhang
9208591f05 fix: use fp16 dtype for sm75 (#1136) 2024-08-18 00:45:42 +10:00
Liangsheng Yin
f624f6a6cc Fix port conflicts between local CI and runner CI. (#1131) 2024-08-16 15:12:38 -07:00
Liangsheng Yin
3694f8f996 Mixed style of chunked prefill (#1013) 2024-08-16 09:13:00 +00:00
Lianmin Zheng
5a261bd055 Fix the deadlock in multi-node tp (#1122) 2024-08-16 01:39:24 -07:00
Yineng Zhang
5bd953749b chore: bump v0.2.13 (#1111) 2024-08-16 03:50:43 +10:00
Lianmin Zheng
0cb099e20a set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113) 2024-08-16 03:47:39 +10:00
Ying Sheng
93d4e354d8 [Fix] Window attention compatible with RadixAttention and chunked prefill (#1112) 2024-08-15 10:33:20 -07:00
Yineng Zhang
9195d1362a misc: rm unused model_loader (#1110) 2024-08-15 08:29:35 -07:00
Ying Sheng
14cb544d56 [Fix] fix flashinfer usage for window attention (#1107) 2024-08-15 00:53:24 -07:00
Lianmin Zheng
e86b1ccbf0 Enable chunked prefill by default (#1040) 2024-08-14 21:56:20 -07:00
Ying Sheng
8d2d876fc8 [Fix] fix the typo bug for window attention (#1106) 2024-08-14 21:56:01 -07:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Ying Sheng
6767e2229f Support jinja as chat template file (#1104) 2024-08-14 17:43:14 -07:00
Liangsheng Yin
73cf6834f2 Support stop_token_ids in sglang API (#1092) 2024-08-15 00:31:39 +00:00
Ying Sheng
96a2093ef0 [Fix] Compatibility of window attention and cuda graph (#1090) 2024-08-14 10:37:01 -07:00
Liangsheng Yin
a34dd86a7d Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-08-14 15:58:07 +00:00
Lianmin Zheng
a59636bb5e Update grok 1 model (#1095) 2024-08-14 04:40:44 -07:00
Lianmin Zheng
8f790ac100 Fix a bug in cuda graph runner (#1094) 2024-08-14 03:25:38 -07:00
rainred
616b59f384 [Feature] modify Runtime to support skip_tokenizer_init (#1088)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-14 00:28:04 -07:00
Liangsheng Yin
e205527cb1 Fix jump forward final state circular path bug. (#1084) 2024-08-13 21:14:05 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Lianmin Zheng
ad3e4f1619 Update the mixtral to use the better FusedMoE layer (#1081) 2024-08-13 15:44:25 -07:00
rainred
95f5fbf1a7 Fix create_abort_task, GenerateReqInput does not have rids. (#1079)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-13 12:47:22 +00:00
Yineng Zhang
f7fb68d292 ci: add moe test (#1053) 2024-08-13 18:43:23 +10:00
Yineng Zhang
65915f9f3e fix: temporary solution for DeepSeek V2 H100 layout conversion issue (#1060)
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-08-13 15:48:54 +10:00
Ke Bao
162f3ccb01 Fix layernorm input shape (#1066)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-13 15:48:07 +10:00
Yineng Zhang
65e89baea9 fix: not use the default port (#1068) 2024-08-13 15:12:56 +10:00
Yineng Zhang
6a38efa834 feat: replace all rmsnorm and silu (#1057) 2024-08-13 02:15:59 +10:00
Yineng Zhang
b0ad0c1bc8 chore: bump v0.2.12 (#1048) 2024-08-12 20:59:38 +10:00
Lianmin Zheng
c877292cc1 Re-organize CI tests (#1052) 2024-08-12 03:39:01 -07:00
Lianmin Zheng
0c1c72a0b4 Fix accuracy test (#1051) 2024-08-12 19:48:40 +10:00
Lianmin Zheng
41598e0d8e Add longer accuracy test on CI (#1049) 2024-08-12 09:21:38 +00:00
Ying Sheng
32f6144323 fix: Fix returned prefill logits and add output str test (#1046) 2024-08-12 06:13:45 +00:00
Lianmin Zheng
fb1f28cbbb Clean up the comments and names under python/sglang/srt/layers (#1047) 2024-08-12 05:54:37 +00:00
Liangsheng Yin
fb7421db0d minor: some potential bugs (#1044) 2024-08-12 05:35:44 +00:00
Lianmin Zheng
8207637029 Improve end-to-end throughput test and its coverage (#1039) 2024-08-11 18:27:33 -07:00
Liangsheng Yin
7de6034534 Fix the prefix indices (#1037) 2024-08-11 17:57:02 -07:00
Lianmin Zheng
d84c5e70f7 Test the case when max_new_tokens is very large (#1038) 2024-08-11 16:41:03 -07:00
Lianmin Zheng
d785412077 Fix the case when max_new_tokens is too large (#1025) 2024-08-11 15:20:18 -07:00
Liangsheng Yin
7b6a5332ca Fix triton args init (#1034) 2024-08-11 12:11:26 -07:00
Lianmin Zheng
4080e82244 Fix the case where r.prefix_indices is None (#1031) 2024-08-11 04:53:51 -07:00