Commit Graph

234 Commits

Author SHA1 Message Date
Lianmin Zheng
eb1ae6ae0c Add sglang.bench_latency for offline benchmark (#564) 2024-06-25 03:38:04 -07:00
Lianmin Zheng
2187f36237 Add a new arguments log_level_http to control the HTTP logging (#563) 2024-06-25 01:16:20 -07:00
Lianmin Zheng
9465b668b9 Allow running with vllm==0.4.3 (#561) 2024-06-24 15:24:21 -07:00
Lianmin Zheng
1fa15099d8 Add LlamaForClassification (#559) 2024-06-22 00:49:31 -07:00
Lianmin Zheng
303ef8883e Clean up logits processor (#558) 2024-06-22 00:25:24 -07:00
Lianmin Zheng
e94e60d6fb make flashinfer workspace larger 2024-06-21 17:32:36 -07:00
Lianmin Zheng
d2f8bfb2e1 Follow-up fixes for flashinfer 0.0.5 (#556) 2024-06-20 23:19:52 -07:00
Lianmin Zheng
b7e2f800ac Update flashinfer to 0.0.5 (#554) 2024-06-20 20:29:06 -07:00
Ying Sheng
09593e9bc9 Multi-node Tensor Parallelism (#550)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-06-17 20:41:24 -07:00
Lianmin Zheng
53a7ebd89a Update fused_moe (#553) 2024-06-17 09:47:58 -07:00
Liangsheng Yin
ad5f04d6ce Fix the Jump-Forward with Chinese (#551) 2024-06-16 21:45:04 +08:00
Qubitium-modelcloud
bbec01c9aa Fix tp worker only checking req[0] for stream (#546) 2024-06-14 22:56:10 -07:00
Ying Sheng
fb9296f0ed Higher priority for user input of max_prefill_tokens & format (#540) 2024-06-12 21:48:40 -07:00
Ying Sheng
1374334d38 Fix dependency & crash issues (#539) 2024-06-12 21:23:19 -07:00
Lianmin Zheng
94aead9e8d Fix dependency (#538) 2024-06-12 13:17:35 -07:00
Liangsheng Yin
9c902b1954 Decode Incrementally (#517) 2024-06-11 23:39:12 -07:00
ZhouXingg
111991fe23 Fix Regression: Disable p2p for 4090 (#531)
Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com>
2024-06-11 23:27:17 -07:00
Qubitium
a8c787d2b3 Add ChatGLM Model Support (#516)
Co-authored-by: ZX <zx@lbx.dev>
2024-06-11 16:39:52 -07:00
Fabian Preiß
5f283991e9 [Minor] Correct Optional type hints in api (#526) 2024-06-11 16:37:27 -07:00
Fabian Preiß
542bc733d6 Fix missing numpy dependency in pyproject.toml (#524) 2024-06-10 12:13:50 -07:00
Lianmin Zheng
f6dbd24043 Improve doc strings (#518) 2024-06-08 02:39:32 -07:00
Lianmin Zheng
e8a2327d52 Update version to 0.1.17 (#515) 2024-06-07 19:49:18 -07:00
Lianmin Zheng
91f93f141f Crash the server when error or OOM happens (#514) 2024-06-07 19:22:34 -07:00
Qubitium
f70f72586a Fix rid state map leak + Refractor .finished (#505)
Co-authored-by: ZX <zx@lbx.dev>
2024-06-07 13:20:40 -07:00
Lianmin Zheng
c0ae70c8ed Improve logging & fix litellm dependency. (#512) 2024-06-07 13:10:32 -07:00
胡译文
87260b7bfd Litellm Backend (#502) 2024-06-07 12:24:28 -07:00
Amos You
651a23ee7c remove redundant pad_input_ids function (#500) 2024-06-07 12:23:29 -07:00
Lianmin Zheng
bf3e271fe0 Update vllm to v0.4.3 (#511)
Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-06-07 12:11:31 -07:00
Lianmin Zheng
3bc01ac137 [Minor] improve code style 2024-06-03 18:11:34 -07:00
Lianmin Zheng
159cc741e4 Make the server random by default (#493) 2024-05-31 23:33:34 -07:00
Ying Sheng
83525a1df2 Revert "Make the server random by default" (#492) 2024-05-31 12:00:21 -07:00
Lianmin Zheng
80a33ce8b0 Do not set the default value of global random seed (#488) 2024-05-29 18:41:18 -04:00
Lianmin Zheng
1a57e41679 do not launch workers in parallel 2024-05-27 23:00:16 -07:00
Ying Sheng
0463f7fb52 Support data parallelism (static) (#480)
Co-authored-by: Ying Sheng <ying.sheng@databricks.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2024-05-27 21:24:10 -07:00
Lianmin Zheng
565d727409 improve logging & fix vllm version 2024-05-27 15:04:23 -07:00
Lianmin Zheng
09de730dee Improve benchmark scripts & add more models (#484) 2024-05-27 14:13:26 -07:00
Lianmin Zheng
55c1643627 Improve benchmark scripts & rename some scripts (#477) 2024-05-26 12:51:45 -07:00
Li Bo
2b605ab1d7 [Feat/Fix] Refactoring Llava models into single file (#475) 2024-05-26 12:29:51 -07:00
Liangsheng Yin
f06e90c2cf Optimize retract (#440) 2024-05-26 00:07:26 +08:00
Lianmin Zheng
2cea6146d8 Improve logging & add logit cap (#471) 2024-05-24 03:48:53 -07:00
Lianmin Zheng
0fafc5606b port fp8 mixtral (#460) 2024-05-21 11:46:35 -07:00
Lianmin Zheng
19d2135cb8 Use model loader from vllm (#459) 2024-05-21 09:13:37 -07:00
Lianmin Zheng
ced77c6626 Rename api_num_spec_tokens -> num_api_spec_tokens (#458) 2024-05-20 18:44:23 -07:00
Lianmin Zheng
8dbdc018a3 Abort disconnected requests (#457) 2024-05-20 18:41:21 -07:00
Ying Sheng
3e684be7a3 Fix openai speculative execution (#456) 2024-05-20 17:01:13 -07:00
LiviaSun
ec380dfd30 openai chat speculative execution (#250)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-05-18 22:23:53 -07:00
Liangsheng Yin
5b647543c1 Fix the broken --disable-radix-cache (#451) 2024-05-19 13:00:12 +08:00
Lianmin Zheng
8210ec60f4 Improve error handling & abort disconnected requests (#449) 2024-05-17 05:49:31 -07:00
Ying Sheng
5be9eb8a8c Add PUT for generate api (#448) 2024-05-17 02:35:15 -07:00
Lianmin Zheng
c05956e534 Simplify port allocation (#447) 2024-05-16 18:07:30 -07:00