Commit Graph

  • d3efcb3930 Update workflow files (#1214) Lianmin Zheng 2024-08-25 17:45:35 -07:00
  • 2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204) Ke Bao 2024-08-26 08:38:11 +08:00
  • 61bb223e0f Update CI runner docs (#1213) Lianmin Zheng 2024-08-25 17:31:52 -07:00
  • 15f1a49d2d Update CI workflows (#1210) Lianmin Zheng 2024-08-25 16:43:07 -07:00
  • 308d024092 [CI] Fix the issue of unit test hanging (#1211) Ying Sheng 2024-08-25 16:21:37 -07:00
  • ab4990e4bf [Minor] Temporarily skip flaky test (#1209) Ying Sheng 2024-08-25 14:49:23 -07:00
  • 902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) Lianmin Zheng 2024-08-25 14:46:34 -07:00
  • 30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186) Chayenne 2024-08-26 01:29:12 +08:00
  • 66e7dcaf70 [Fix] Fixing the multi-images error for llava-onevision (#1205) Kaichen Zhang - NTU 2024-08-26 01:28:23 +08:00
  • bc4c7a3545 Relax the assert in moe throughput test to fix the flaky CI (#1207) Lianmin Zheng 2024-08-25 10:27:02 -07:00
  • 1cb4da5c5f [Fix] the issue of random order when input is a list (#1199) Ying Sheng 2024-08-24 21:43:03 -07:00
  • e61d13acdf [CI] Fix the problem of hf runner too slow (#1202) Ying Sheng 2024-08-24 18:35:55 -07:00
  • b20daf982a Update README.md (#1198) Lianmin Zheng 2024-08-24 14:50:05 -07:00
  • f6af3a6561 Cleanup readme, llava examples, usage examples and nccl init (#1194) Lianmin Zheng 2024-08-24 08:02:23 -07:00
  • c9064e6fd9 feat: use gelu_tanh_and_mul (#1193) Yineng Zhang 2024-08-24 18:58:16 +10:00
  • a5b14ad043 [Feat/WIP] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. (#1123) Kaichen Zhang - NTU 2024-08-24 05:11:16 +08:00
  • 5fafcac008 Fix benchmark script (#1185) Ying Sheng 2024-08-22 02:03:25 -07:00
  • 364d3d72a7 Fix broken penalty (#1184) Liangsheng Yin 2024-08-22 01:16:35 -07:00
  • 5623826f73 [Minor] Improve logging and rename the health check endpoint name (#1180) Lianmin Zheng 2024-08-21 19:24:36 -07:00
  • 83e23c69b3 Improve code style of sampler (#1168) Liangsheng Yin 2024-08-21 16:48:24 -07:00
  • ac1b74fa85 [Docs] Fix rendering of details in README (#1179) Zhanghao Wu 2024-08-21 16:05:33 -07:00
  • 068e9eae55 Support min-p sampling (#1167) intervitens 2024-08-22 01:49:32 +03:00
  • d6aeb9fa15 [Feature] Add a function to convert sampling_params to kwargs (#1170) rainred 2024-08-22 05:28:35 +08:00
  • 1fb9459908 fix: custom op fallback forward native when lower sm80 (#1177) Yineng Zhang 2024-08-22 07:26:35 +10:00
  • bea2bb9eea Improve multi-node stability (#1171) Lianmin Zheng 2024-08-20 22:35:05 -07:00
  • cd10654e7e [Feat] Support update weights without restart server (#1157) Shan Yu 2024-08-20 13:48:24 -07:00
  • 350a81609b fix: resolve README render (#1166) Yineng Zhang 2024-08-21 03:23:52 +10:00
  • 6242c399ab Generate 1 token to verify the health of the inference service in /health (#1154) Lucien 2024-08-21 01:14:34 +08:00
  • 04707b09b7 misc: add hypervisor vendor (#1165) Yineng Zhang 2024-08-21 02:14:51 +10:00
  • ff2cfdb1a2 [Feature] add disable-custom-all-reduce (#1148) Xu-Chen 2024-08-20 23:44:12 +08:00
  • a8ae640328 Improve docs and warnings (#1164) Lianmin Zheng 2024-08-20 08:31:29 -07:00
  • d8476818ef feat: allow streaming for multi-prompt and/or parallel sampling (#1134) Juwan Yoo 2024-08-20 08:06:55 -07:00
  • df191254ab Optimize MLA/GQA/MQA Triton decoding (#1138) Ke Bao 2024-08-19 18:23:07 +08:00
  • b997a18d74 [Feat]Add support for optional start len of logprobs (#1035) yichuan~ 2024-08-18 23:45:41 -07:00
  • d8627ed16d [Docs] Add instruction for running on clouds and kubernetes with SkyPilot (#1144) Zhanghao Wu 2024-08-18 23:01:55 -07:00
  • fa13b95d6b fixed a typo (#1143) min-xu-et 2024-08-18 14:29:09 -07:00
  • 3c1f5a9220 Fix duplicated imports in hf_transformers_utils.py (#1141) Lianmin Zheng 2024-08-17 18:03:00 -07:00
  • 57d0bd91ec Improve benchmark (#1140) Lianmin Zheng 2024-08-17 17:43:23 -07:00
  • cdc8d60752 Improve the code style: more comments and remove useless packages (#1139) Lianmin Zheng 2024-08-17 14:37:52 -07:00
  • 9208591f05 fix: use fp16 dtype for sm75 (#1136) Yineng Zhang 2024-08-17 22:45:42 +08:00
  • 5d0d40d0eb Fix CI accuracy && time out limit (#1133) Liangsheng Yin 2024-08-16 21:41:11 -07:00
  • f624f6a6cc Fix port conflicts between local CI and runner CI. (#1131) Liangsheng Yin 2024-08-16 15:12:38 -07:00
  • 3694f8f996 Mixed style of chunked prefill (#1013) Liangsheng Yin 2024-08-16 02:13:00 -07:00
  • 5a261bd055 Fix the deadlock in multi-node tp (#1122) Lianmin Zheng 2024-08-16 01:39:24 -07:00
  • 6aa8ad14f8 fix: resolve Python.h header missing (#1119) Yineng Zhang 2024-08-16 13:46:43 +08:00
  • 26e9c12c15 ci: compatible with fork repo (#1115) Yineng Zhang 2024-08-16 02:26:44 +08:00
  • 87a0db82b8 update hyperparameter guide (#1114) Lianmin Zheng 2024-08-15 10:54:24 -07:00
  • 5bd953749b chore: bump v0.2.13 (#1111) Yineng Zhang 2024-08-16 01:50:43 +08:00
  • 0cb099e20a set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113) Lianmin Zheng 2024-08-15 10:47:39 -07:00
  • 93d4e354d8 [Fix] Window attention compatible with RadixAttention and chunked prefill (#1112) Ying Sheng 2024-08-15 10:33:20 -07:00
  • 9195d1362a misc: rm unused model_loader (#1110) Yineng Zhang 2024-08-15 23:29:35 +08:00
  • 14cb544d56 [Fix] fix flashinfer usage for window attention (#1107) Ying Sheng 2024-08-15 00:53:24 -07:00
  • e86b1ccbf0 Enable chunked prefill by default (#1040) Lianmin Zheng 2024-08-14 21:56:20 -07:00
  • 8d2d876fc8 [Fix] fix the typo bug for window attention (#1106) Ying Sheng 2024-08-14 21:56:01 -07:00
  • 326df4bab2 Use a single workspace for flashinfer (#1077) Lianmin Zheng 2024-08-14 19:25:37 -07:00
  • 6767e2229f Support jinja as chat template file (#1104) Ying Sheng 2024-08-14 17:43:14 -07:00
  • 73cf6834f2 Support stop_token_ids in sglang API (#1092) Liangsheng Yin 2024-08-14 17:31:39 -07:00
  • 1c2b5f5240 docs: update nsys usage (#1103) Yineng Zhang 2024-08-15 01:39:15 +08:00
  • 96a2093ef0 [Fix] Compatibility of window attention and cuda graph (#1090) Ying Sheng 2024-08-14 10:37:01 -07:00
  • a34dd86a7d Use dtype to control generate (#1082) Liangsheng Yin 2024-08-14 08:58:07 -07:00
  • 67c0d832a6 docs: update pr template (#1099) Yineng Zhang 2024-08-14 20:25:39 +08:00
  • a59636bb5e Update grok 1 model (#1095) Lianmin Zheng 2024-08-14 04:40:44 -07:00
  • fe5024325b docs: update README (#1098) Yineng Zhang 2024-08-14 19:40:05 +08:00
  • f14569f64a ci: remove workflow path trigger (#1096) Yineng Zhang 2024-08-14 18:36:24 +08:00
  • 8f790ac100 Fix a bug in cuda graph runner (#1094) Lianmin Zheng 2024-08-14 03:25:38 -07:00
  • 616b59f384 [Feature] modify Runtime to support skip_tokenizer_init (#1088) rainred 2024-08-14 15:28:04 +08:00
  • c8423ca311 ci: update timeout and retry (#1086) Yineng Zhang 2024-08-14 15:27:35 +08:00
  • e205527cb1 Fix jump forward final state circular path bug. (#1084) Liangsheng Yin 2024-08-13 21:14:05 -07:00
  • 0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) Ying Sheng 2024-08-13 17:01:26 -07:00
  • ad3e4f1619 Update the mixtral to use the better FusedMoE layer (#1081) Lianmin Zheng 2024-08-13 15:44:25 -07:00
  • 312e849255 Example file for docker compose and k8s (#1006) Lucien 2024-08-14 06:07:57 +08:00
  • 95f5fbf1a7 Fix create_abort_task, GenerateReqInput does not have rids. (#1079) rainred 2024-08-13 20:47:22 +08:00
  • cebd78d83e ci: add accuracy timeout (#1078) Yineng Zhang 2024-08-13 20:12:58 +08:00
  • 0076f11541 fix: use devel for Triton's compiler requirements (#1074) Yineng Zhang 2024-08-13 19:08:43 +08:00
  • f7fb68d292 ci: add moe test (#1053) Yineng Zhang 2024-08-13 16:43:23 +08:00
  • 396a13e6ad ci: add cancel pr workflow (#1070) Yineng Zhang 2024-08-13 16:16:50 +08:00
  • 65915f9f3e fix: temporary solution for DeepSeek V2 H100 layout conversion issue (#1060) Yineng Zhang 2024-08-13 13:48:54 +08:00
  • 162f3ccb01 Fix layernorm input shape (#1066) Ke Bao 2024-08-13 13:48:07 +08:00
  • 65e89baea9 fix: not use the default port (#1068) Yineng Zhang 2024-08-13 13:12:56 +08:00
  • 6a38efa834 feat: replace all rmsnorm and silu (#1057) Yineng Zhang 2024-08-13 00:15:59 +08:00
  • b0ad0c1bc8 chore: bump v0.2.12 (#1048) Yineng Zhang 2024-08-12 18:59:38 +08:00
  • c877292cc1 Re-organize CI tests (#1052) Lianmin Zheng 2024-08-12 03:39:01 -07:00
  • 0c1c72a0b4 Fix accuracy test (#1051) Lianmin Zheng 2024-08-12 02:48:40 -07:00
  • 41598e0d8e Add longer accuracy test on CI (#1049) Lianmin Zheng 2024-08-12 02:21:38 -07:00
  • 89f23a5178 docs: update setup github runner (#1050) Yineng Zhang 2024-08-12 16:11:38 +08:00
  • cb99ba4fc6 feat: update Dockerfile (#1033) Yineng Zhang 2024-08-12 14:24:06 +08:00
  • 32f6144323 fix: Fix returned prefill logits and add output str test (#1046) Ying Sheng 2024-08-11 23:13:45 -07:00
  • fb1f28cbbb Clean up the comments and names under python/sglang/srt/layers (#1047) Lianmin Zheng 2024-08-11 22:54:37 -07:00
  • fb7421db0d minor: some potential bugs (#1044) Liangsheng Yin 2024-08-11 22:35:44 -07:00
  • 14b6493087 Delete the useless test/srt/test_throughput.py (#1045) Lianmin Zheng 2024-08-11 21:31:52 -07:00
  • 8207637029 Improve end-to-end throughput test and its coverage (#1039) Lianmin Zheng 2024-08-11 18:27:33 -07:00
  • 7de6034534 Fix the prefix indices (#1037) Liangsheng Yin 2024-08-11 17:57:02 -07:00
  • d84c5e70f7 Test the case when max_new_tokens is very large (#1038) Lianmin Zheng 2024-08-11 16:41:03 -07:00
  • d785412077 Fix the case when max_new_tokens is too large (#1025) Lianmin Zheng 2024-08-11 15:20:18 -07:00
  • 7b6a5332ca Fix triton args init (#1034) Liangsheng Yin 2024-08-11 12:11:26 -07:00
  • 4080e82244 Fix the case where r.prefix_indices is None (#1031) Lianmin Zheng 2024-08-11 04:53:51 -07:00
  • c245b78973 hotfix: add CustomOp abstraction (#1027) Yineng Zhang 2024-08-11 17:45:59 +08:00
  • 9dae407812 Improve type annotation (#1029) Lianmin Zheng 2024-08-11 02:44:59 -07:00
  • fcc0f5ed99 Fix wrong assert (#1028) Liangsheng Yin 2024-08-11 02:22:16 -07:00
  • a97df79124 Clean up readme and arguments of chunked prefill (#1022) Lianmin Zheng 2024-08-11 01:18:52 -07:00