Commit Graph

  • 712216928f [Feature] Initial support for multi-LoRA serving (#1307) Ying Sheng 2024-09-12 16:46:14 -07:00
  • c33d82a211 Add Support for XVERSE Models (Dense and MoE) to sglang (#1397) hxer7963 2024-09-12 16:47:52 +08:00
  • 8234e663e9 [Minor Fix] Fix llava modalities issue for single-image (#1402) Kaichen Zhang - NTU 2024-09-12 16:10:26 +08:00
  • debbdb5178 kernel: use tensor cores for flashinfer gqa kernels (#1403) Zihao Ye 2024-09-12 00:38:18 -07:00
  • 3efa798116 Support cuda graph in the triton attention backend (#1401) Lianmin Zheng 2024-09-12 00:36:55 -07:00
  • 2a71be5e25 Fix README format (#1399) William 2024-09-12 14:46:51 +08:00
  • 4462137777 Add no commit to main rule (#1393) Liangsheng Yin 2024-09-11 14:40:45 -07:00
  • fec185ce0c Refactor attention backend (#1381) Lianmin Zheng 2024-09-11 11:44:26 -07:00
  • c03cece42f Improve error reporting during server launch (#1390) Lianmin Zheng 2024-09-11 04:50:04 -07:00
  • 15c75e4146 [Fix] Fix --disable-flashinfer (#1389) Lianmin Zheng 2024-09-11 04:36:21 -07:00
  • 224200e3c2 BaiChuan2 Model (#1367) Vectory 2024-09-11 18:55:24 +08:00
  • 8c0efa514d remove assertion in triton attention and add an unit test (#1385) Byron Hsu 2024-09-11 03:22:07 -07:00
  • 144bc70fcc Organize flashinfer indices update (#1378) Liangsheng Yin 2024-09-10 17:38:59 -07:00
  • 46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) Lianmin Zheng 2024-09-10 17:11:16 -07:00
  • 3a6e8b6d78 [Minor] move triton attention kernels into a separate folder (#1379) Lianmin Zheng 2024-09-10 15:15:08 -07:00
  • fbb4754cb8 Fix vocab mask update bug (#1376) Liangsheng Yin 2024-09-10 13:10:36 -07:00
  • 6c7cb90365 [Minor] improve kill scripts and torchao import (#1375) Lianmin Zheng 2024-09-10 11:27:03 -07:00
  • dff2860a69 Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy (#1373) josephrocca 2024-09-11 00:35:03 +08:00
  • e72275cf7f Support MiniCPM3 (#1371) William 2024-09-10 17:57:52 +08:00
  • fec2d1223c [Fix] fix bug of undefined is_single in meth create_abort_task (#1370) wangchao 2024-09-10 16:17:37 +08:00
  • 8d1095dbf0 [Docs] Improve documentations (#1368) Lianmin Zheng 2024-09-09 20:48:28 -07:00
  • 743007e1ce Adding Documentation for installation (#1300) Chayenne 2024-09-10 10:09:13 +08:00
  • 9144ed1067 Support OpenAI API json_schema response format (#1363) zifeitong 2024-09-09 19:08:25 -07:00
  • 69b3bb9ae1 Unify forward mode (#1360) Liangsheng Yin 2024-09-09 13:49:29 -07:00
  • 689ff588ec [CI] Return output logprobs in unit test (#1361) Ying Sheng 2024-09-09 13:05:13 -07:00
  • a7c47e0f02 Add torchao quant (int4/int8/fp8) to llama models (#1341) Jerry Zhang 2024-09-09 05:32:41 -07:00
  • e4d68afcf0 [Minor] Many cleanup (#1357) Lianmin Zheng 2024-09-09 04:14:11 -07:00
  • c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298) Kai-Hsun Chen 2024-09-09 02:14:25 -07:00
  • 662ecd9368 [Feat] Add modalities for vision server when handling pixel values for llava (#1346) Kaichen Zhang - NTU 2024-09-09 17:07:34 +08:00
  • 8e6bdf851c [triton] Support head_dim not 2^n in triton extend and decode attention (#1281) Byron Hsu 2024-09-09 01:30:24 -07:00
  • 05bea6883c Fix some online scheduling delay (#1345) Liangsheng Yin 2024-09-07 20:46:27 -07:00
  • ab4a83b259 Optimize schedule (#1339) Liangsheng Yin 2024-09-05 14:30:26 -07:00
  • 62f15eea5a docs: add conclusion (#1340) Yineng Zhang 2024-09-06 04:25:14 +10:00
  • 79794af52d docs: highlight ttft itl and throughput (#1337) Yineng Zhang 2024-09-06 00:00:06 +10:00
  • 3494b32c3a docs: update README (#1336) Yineng Zhang 2024-09-05 23:39:44 +10:00
  • eda7c09048 Remove useless fields in global_config.py (#1328) Lianmin Zheng 2024-09-04 05:37:32 -07:00
  • 5ab9418f5b [Doc] update news (#1327) Yineng Zhang 2024-09-04 21:21:21 +10:00
  • 843e63d809 Fix the flaky test test_moe_eval_accuracy_large.py (#1326) Lianmin Zheng 2024-09-04 04:15:11 -07:00
  • a63c8275c6 chore: bump v0.3.0 (#1320) Yineng Zhang 2024-09-04 06:32:18 +10:00
  • dc67d97693 misc: speedup load safetensors (#1319) Yineng Zhang 2024-09-04 04:29:53 +10:00
  • 1e495e0847 [Fix] Fix select by ensuring each request has at least one token (#1318) Lianmin Zheng 2024-09-03 06:31:45 -07:00
  • 12cb115d38 Fix llama2 weight loader (#1317) Lianmin Zheng 2024-09-03 05:32:14 -07:00
  • c500f96bb1 Update README.md for llava-onevision instructions (#1313) Lianmin Zheng 2024-09-03 01:43:08 -07:00
  • 474317f2b6 Support Phi3 mini and medium (#1299) Jani Monoses 2024-09-03 07:49:40 +03:00
  • f64eae3a29 [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308) Lianmin Zheng 2024-09-02 21:44:45 -07:00
  • a5a134f39f Fix bugs in sampler with CUDA graph / torch.compile (#1306) Liangsheng Yin 2024-09-02 16:18:48 -07:00
  • 2561ed012c feat: update nightly gsm8k eval (#1304) Yineng Zhang 2024-09-03 01:18:41 +10:00
  • 9999442756 Release v0.2.15 (#1295) Lianmin Zheng 2024-09-01 22:22:38 -07:00
  • 6def9b018c Fix hang when doing s += None. (#1297) Max Shawabkeh 2024-09-01 21:56:33 -07:00
  • 47f20da223 Fix regex mask (#1296) Liangsheng Yin 2024-09-01 21:50:58 -07:00
  • 4a9f8ea43b [doc] Fix more broken links (#1294) Byron Hsu 2024-09-01 14:46:36 -07:00
  • 58fa607622 Fix the flaky tests in test_moe_eval_accuracy_large.py (#1293) Lianmin Zheng 2024-09-01 12:20:46 -07:00
  • 6487ef64c6 ci: add nightly eval (#1291) Yineng Zhang 2024-09-02 03:19:49 +10:00
  • 9b0805242e fix: resolve fp8 for mixtral (#1290) Yineng Zhang 2024-09-02 00:29:06 +10:00
  • 32a4141d5a Allow new lines during JSON generation (#1277) Enrique Shockwave 2024-09-01 11:42:29 +01:00
  • 0836055324 [Chore] Rename model_overide_args to model_override_args (#1284) Kai-Hsun Chen 2024-09-01 03:14:56 -07:00
  • 00b19f198f [triton] Remove the zero initialization of qk_acc by directly writing the result (#1288) Byron Hsu 2024-09-01 03:12:06 -07:00
  • 6cb32ef92c Support Triton fp8 e5m2 kv cache (#1286) Ke Bao 2024-09-01 17:46:40 +08:00
  • 761b2cebd6 [CI] merge all ci tests into one file (#1289) Lianmin Zheng 2024-09-01 02:36:56 -07:00
  • 54772f784a feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 (#1285) Yineng Zhang 2024-09-01 17:28:06 +10:00
  • 1b5d56f7f8 [CI] Add more multi-gpu tests (#1280) Lianmin Zheng 2024-09-01 00:27:25 -07:00
  • d134c139a1 Optimize the update flashinfer indices (#1262) xiaobochen 2024-08-31 23:40:28 -07:00
  • 6cc9c52521 [doc] fix quick start link (#1282) Byron Hsu 2024-08-31 22:54:34 -07:00
  • 52cefdbf57 fix: resolve the fp8 bug introduced by vLLM 0.5.5 (#1276) Yineng Zhang 2024-09-01 00:44:29 +10:00
  • 51c554d812 Allow more flexible assistant and system response (#1256) Christopher Chou 2024-08-30 11:51:44 -07:00
  • 79ece2c51f Report median instead of mean in bench_latency.py (#1269) Lianmin Zheng 2024-08-30 06:05:01 -07:00
  • 55f5976b42 Update README.md - Supported Models add Exaone 3.0 (#1267) 김종곤 2024-08-30 17:49:07 +09:00
  • b7f8341014 EXAONE 3.0 Model Support (#1258) 김종곤 2024-08-30 17:08:28 +09:00
  • f414352ae6 Transpose mla weight offline (#1261) Ke Bao 2024-08-30 14:45:40 +08:00
  • a362340b33 fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader (#1260) lxww302 2024-08-29 23:43:41 -07:00
  • 381dd57bd6 Sampler cudagraph (#1253) Liangsheng Yin 2024-08-28 18:58:52 -07:00
  • 8153168c96 fix data racing due to mutable reference using deepcopy (#1255) Zhiqiang Xie 2024-08-28 18:57:54 -07:00
  • 6c34d6339c make json_schema usable from gen (#1254) Enrique Shockwave 2024-08-29 02:57:10 +01:00
  • 13ac95b894 chore: bump v0.2.14.post2 (#1250) Yineng Zhang 2024-08-29 04:46:33 +10:00
  • 492143bf32 fix: resolve qwen2 moe weight loader (#1252) Yineng Zhang 2024-08-29 04:25:46 +10:00
  • 0a97d7962d [Fix] Fix OOM in llava base class (#1249) Lianmin Zheng 2024-08-28 08:38:50 -07:00
  • c411f32e1c feat: replace GeluAndMul (#1234) Yineng Zhang 2024-08-29 00:07:02 +10:00
  • bf53bf5142 [Fix] Fix llava on multi images (#1247) Lianmin Zheng 2024-08-28 06:33:05 -07:00
  • b1a540ec42 feat: update GemmaRMSNorm (#1232) Yineng Zhang 2024-08-28 22:47:34 +10:00
  • 66975360e7 fix: increase max_new_tokens when testing generation models (#1244) Yineng Zhang 2024-08-28 22:12:36 +10:00
  • 6c49831394 Add sglang.bench_latency to CI (#1243) Lianmin Zheng 2024-08-28 04:20:54 -07:00
  • f25f4dfde5 hotfix: revert sampler CUDA Graph (#1242) Yineng Zhang 2024-08-28 21:16:47 +10:00
  • 184ae1c683 Update README.md (#1239) Lianmin Zheng 2024-08-28 02:15:52 -07:00
  • 198974cd1a feat: support sm75 with FlashInfer v0.1.6 (#1233) Yineng Zhang 2024-08-28 18:39:12 +10:00
  • 6cc38b2bf3 [Minor] Add more type annotations (#1237) Lianmin Zheng 2024-08-28 00:54:26 -07:00
  • 1ece2cda3d Fix bench latency benchmark (#1225) Liangsheng Yin 2024-08-28 00:37:32 -07:00
  • c8a9e79186 Fix readme (#1236) Dr. Artificial曾小健 2024-08-28 14:51:41 +08:00
  • 3602692c7c feat: replace get_act_fn for gpt_bigcode (#1231) Yineng Zhang 2024-08-27 21:15:31 +10:00
  • 909f34363b [FIX] Wrong logger (#1230) havetc 2024-08-27 12:10:46 +02:00
  • 5ff25cdf5b [Minor] add delete test and delete tmp file on ci server (#1227) yichuan~ 2024-08-26 22:04:52 -07:00
  • 2f1d92834f [FEAT] Support batches cancel (#1222) caiyueliang 2024-08-27 07:28:26 +08:00
  • c61a1b6f97 Torch compile CI throughput test (#1223) Liangsheng Yin 2024-08-26 13:52:58 -07:00
  • 9935f97b3e [FEAT] JSON constrained support (#1125) havetc 2024-08-26 18:37:26 +02:00
  • c5fe11a8e1 chore: bump v0.2.14 (#1155) Yineng Zhang 2024-08-27 00:28:24 +10:00
  • 75ce37f401 Move sampler into CUDA graph (#1201) Liangsheng Yin 2024-08-26 07:02:50 -07:00
  • 97589a60a2 [CI] Parallelize unit tests in CI (#1219) Mingyi 2024-08-25 21:54:02 -07:00
  • 632d506d0b minor: improve CI and dependencies (#1212) Liangsheng Yin 2024-08-25 21:26:31 -07:00
  • 3579162ab1 [Fix] Multi-images loading error (#1218) Kaichen Zhang - NTU 2024-08-26 11:58:51 +08:00
  • 7514b9f8d3 [CI] Fix CI (#1217) Mingyi 2024-08-25 19:56:42 -07:00
  • 158e8f1e2d improve the threshold and ports in tests (#1215) Mingyi 2024-08-25 19:02:08 -07:00