Ke Bao
|
df191254ab
|
Optimize MLA/GQA/MQA Triton decoding (#1138)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-19 20:23:07 +10:00 |
|
yichuan~
|
b997a18d74
|
[Feat]Add support for optional start len of logprobs (#1035)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2024-08-18 23:45:41 -07:00 |
|
min-xu-et
|
fa13b95d6b
|
fixed a typo (#1143)
|
2024-08-18 14:29:09 -07:00 |
|
Lianmin Zheng
|
3c1f5a9220
|
Fix duplicated imports in hf_transformers_utils.py (#1141)
|
2024-08-17 18:03:00 -07:00 |
|
Lianmin Zheng
|
57d0bd91ec
|
Improve benchmark (#1140)
|
2024-08-17 17:43:23 -07:00 |
|
Lianmin Zheng
|
cdc8d60752
|
Improve the code style: more comments and remove useless packages (#1139)
|
2024-08-17 14:37:52 -07:00 |
|
Yineng Zhang
|
9208591f05
|
fix: use fp16 dtype for sm75 (#1136)
|
2024-08-18 00:45:42 +10:00 |
|
Liangsheng Yin
|
f624f6a6cc
|
Fix port conflicts between local CI and runner CI. (#1131)
|
2024-08-16 15:12:38 -07:00 |
|
Liangsheng Yin
|
3694f8f996
|
Mixed style of chunked prefill (#1013)
|
2024-08-16 09:13:00 +00:00 |
|
Lianmin Zheng
|
5a261bd055
|
Fix the deadlock in multi-node tp (#1122)
|
2024-08-16 01:39:24 -07:00 |
|
Yineng Zhang
|
5bd953749b
|
chore: bump v0.2.13 (#1111)
|
2024-08-16 03:50:43 +10:00 |
|
Lianmin Zheng
|
0cb099e20a
|
set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113)
|
2024-08-16 03:47:39 +10:00 |
|
Ying Sheng
|
93d4e354d8
|
[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112)
|
2024-08-15 10:33:20 -07:00 |
|
Yineng Zhang
|
9195d1362a
|
misc: rm unused model_loader (#1110)
|
2024-08-15 08:29:35 -07:00 |
|
Ying Sheng
|
14cb544d56
|
[Fix] fix flashinfer usage for window attention (#1107)
|
2024-08-15 00:53:24 -07:00 |
|
Lianmin Zheng
|
e86b1ccbf0
|
Enable chunked prefill by default (#1040)
|
2024-08-14 21:56:20 -07:00 |
|
Ying Sheng
|
8d2d876fc8
|
[Fix] fix the typo bug for window attention (#1106)
|
2024-08-14 21:56:01 -07:00 |
|
Lianmin Zheng
|
326df4bab2
|
Use a single workspace for flashinfer (#1077)
|
2024-08-14 19:25:37 -07:00 |
|
Ying Sheng
|
6767e2229f
|
Support jinja as chat template file (#1104)
|
2024-08-14 17:43:14 -07:00 |
|
Liangsheng Yin
|
73cf6834f2
|
Support stop_token_ids in sglang API (#1092)
|
2024-08-15 00:31:39 +00:00 |
|
Ying Sheng
|
96a2093ef0
|
[Fix] Compatibility of window attention and cuda graph (#1090)
|
2024-08-14 10:37:01 -07:00 |
|
Liangsheng Yin
|
a34dd86a7d
|
Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2024-08-14 15:58:07 +00:00 |
|
Lianmin Zheng
|
a59636bb5e
|
Update grok 1 model (#1095)
|
2024-08-14 04:40:44 -07:00 |
|
Lianmin Zheng
|
8f790ac100
|
Fix a bug in cuda graph runner (#1094)
|
2024-08-14 03:25:38 -07:00 |
|
rainred
|
616b59f384
|
[Feature] modify Runtime to support skip_tokenizer_init (#1088)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
|
2024-08-14 00:28:04 -07:00 |
|
Liangsheng Yin
|
e205527cb1
|
Fix jump forward final state circular path bug. (#1084)
|
2024-08-13 21:14:05 -07:00 |
|
Ying Sheng
|
0909bb0d2f
|
[Feat] Add window attention for gemma-2 (#1056)
|
2024-08-13 17:01:26 -07:00 |
|
Lianmin Zheng
|
ad3e4f1619
|
Update the mixtral to use the better FusedMoE layer (#1081)
|
2024-08-13 15:44:25 -07:00 |
|
rainred
|
95f5fbf1a7
|
Fix create_abort_task, GenerateReqInput does not have rids. (#1079)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
|
2024-08-13 12:47:22 +00:00 |
|
Yineng Zhang
|
f7fb68d292
|
ci: add moe test (#1053)
|
2024-08-13 18:43:23 +10:00 |
|
Yineng Zhang
|
65915f9f3e
|
fix: temporary solution for DeepSeek V2 H100 layout conversion issue (#1060)
Co-authored-by: ispobock <ISPObaoke@163.com>
|
2024-08-13 15:48:54 +10:00 |
|
Ke Bao
|
162f3ccb01
|
Fix layernorm input shape (#1066)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-13 15:48:07 +10:00 |
|
Yineng Zhang
|
65e89baea9
|
fix: not use the default port (#1068)
|
2024-08-13 15:12:56 +10:00 |
|
Yineng Zhang
|
6a38efa834
|
feat: replace all rmsnorm and silu (#1057)
|
2024-08-13 02:15:59 +10:00 |
|
Yineng Zhang
|
b0ad0c1bc8
|
chore: bump v0.2.12 (#1048)
|
2024-08-12 20:59:38 +10:00 |
|
Lianmin Zheng
|
c877292cc1
|
Re-organize CI tests (#1052)
|
2024-08-12 03:39:01 -07:00 |
|
Lianmin Zheng
|
0c1c72a0b4
|
Fix accuracy test (#1051)
|
2024-08-12 19:48:40 +10:00 |
|
Lianmin Zheng
|
41598e0d8e
|
Add longer accuracy test on CI (#1049)
|
2024-08-12 09:21:38 +00:00 |
|
Ying Sheng
|
32f6144323
|
fix: Fix returned prefill logits and add output str test (#1046)
|
2024-08-12 06:13:45 +00:00 |
|
Lianmin Zheng
|
fb1f28cbbb
|
Clean up the comments and names under python/sglang/srt/layers (#1047)
|
2024-08-12 05:54:37 +00:00 |
|
Liangsheng Yin
|
fb7421db0d
|
minor: some potential bugs (#1044)
|
2024-08-12 05:35:44 +00:00 |
|
Lianmin Zheng
|
8207637029
|
Improve end-to-end throughput test and its coverage (#1039)
|
2024-08-11 18:27:33 -07:00 |
|
Liangsheng Yin
|
7de6034534
|
Fix the prefix indices (#1037)
|
2024-08-11 17:57:02 -07:00 |
|
Lianmin Zheng
|
d84c5e70f7
|
Test the case when max_new_tokens is very large (#1038)
|
2024-08-11 16:41:03 -07:00 |
|
Lianmin Zheng
|
d785412077
|
Fix the case when max_new_tokens is too large (#1025)
|
2024-08-11 15:20:18 -07:00 |
|
Liangsheng Yin
|
7b6a5332ca
|
Fix triton args init (#1034)
|
2024-08-11 12:11:26 -07:00 |
|
Lianmin Zheng
|
4080e82244
|
Fix the case where r.prefix_indices is None (#1031)
|
2024-08-11 04:53:51 -07:00 |
|
Yineng Zhang
|
c245b78973
|
hotfix: add CustomOp abstraction (#1027)
|
2024-08-11 02:45:59 -07:00 |
|
Lianmin Zheng
|
9dae407812
|
Improve type annotation (#1029)
|
2024-08-11 02:44:59 -07:00 |
|
Liangsheng Yin
|
fcc0f5ed99
|
Fix wrong assert (#1028)
|
2024-08-11 09:22:16 +00:00 |
|