Commit Graph

42 Commits

Author SHA1 Message Date
Ying Sheng
98111fbe3e Revert "Chunked prefill support" (#799) 2024-07-29 02:38:31 -07:00
Liangsheng Yin
2ec39ab712 Chunked prefill support (#797) 2024-07-29 02:21:50 -07:00
Ying Sheng
325a06c2de Fix logging (#796) 2024-07-28 23:01:45 -07:00
Yineng Zhang
dd7e8b9421 chore: add copyright for srt (#790) 2024-07-28 23:07:12 +10:00
Lianmin Zheng
30db99b3d9 Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs (#776) 2024-07-27 19:50:34 -07:00
Lianmin Zheng
a036d41980 Fix max new tokens (#772) 2024-07-27 17:22:18 -07:00
Lianmin Zheng
f95e661757 Fix max_tokens for OpenAI chat completion API (#766) 2024-07-27 15:44:27 -07:00
Ying Sheng
30d8e130e7 Improve benchmark scripts (#717) 2024-07-24 14:44:14 -07:00
Ying Sheng
83d2b30d75 format 2024-07-24 10:53:07 +00:00
Ying Sheng
4367f4bb8d Fix prefill size (#711) 2024-07-24 03:41:15 -07:00
Liangsheng Yin
4cd64b8ee6 Auto adjust new ratio (#708) 2024-07-23 22:06:02 -07:00
Liangsheng Yin
268684439b Use min new token ratio at start (#701) 2024-07-23 11:52:50 -07:00
Ying Sheng
06487f126e refactor model loader: initial refactor (#664) 2024-07-20 02:18:22 -07:00
Liangsheng Yin
7620cd37dd Fix jump forward when streaming (#665) 2024-07-19 16:42:06 -07:00
Liangsheng Yin
a9ef49c12c Detokenize incrementally when streaming (#653) 2024-07-18 17:57:40 -07:00
Mingyi
d774acad5c Remove the dependency of rpyc (#646) 2024-07-18 02:13:54 -07:00
Liangsheng Yin
3de2f30a27 Flashinfer sample kernel (#617) 2024-07-17 13:24:43 -07:00
zhyncs
2e341cd493 misc: add pre-commit config (#637) 2024-07-17 11:55:39 -07:00
Ying Sheng
a470e60c97 clean up step function (#635) 2024-07-16 20:15:24 -07:00
Ying Sheng
6a2941f4d0 Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
2024-07-15 07:10:51 -07:00
Mingyi
5ac8b80677 Simplify mem state (#623) 2024-07-15 02:01:09 -07:00
Liangsheng Yin
564a898ad9 Optimize mem indices mangement (#619) 2024-07-13 23:39:37 -07:00
Ying Sheng
5949b1ca0e Fix memory pool index error (#616) 2024-07-13 16:45:11 -07:00
Lianmin Zheng
665815969a Enable cuda graph by default (#612) 2024-07-13 05:29:46 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
Ying Sheng
75b31a2a88 Update run_batch interface and max_prefill_tokens (#574) 2024-06-30 18:26:04 -07:00
sglang
11616fc6bd Minor fix in compiler & format (#545) 2024-06-29 23:42:14 -07:00
Lianmin Zheng
badf3fa020 Expose dtype argument (#569) 2024-06-27 23:30:39 -07:00
Lianmin Zheng
2e6e62e156 Increase the number of thread limitation for tp worker managers. (#567) 2024-06-26 09:33:45 -07:00
Lianmin Zheng
a385ee27bd Warmup cublas (#566) 2024-06-25 12:46:00 -07:00
Lianmin Zheng
303ef8883e Clean up logits processor (#558) 2024-06-22 00:25:24 -07:00
Ying Sheng
09593e9bc9 Multi-node Tensor Parallelism (#550)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-06-17 20:41:24 -07:00
Qubitium-modelcloud
bbec01c9aa Fix tp worker only checking req[0] for stream (#546) 2024-06-14 22:56:10 -07:00
Ying Sheng
fb9296f0ed Higher priority for user input of max_prefill_tokens & format (#540) 2024-06-12 21:48:40 -07:00
Liangsheng Yin
9c902b1954 Decode Incrementally (#517) 2024-06-11 23:39:12 -07:00
Lianmin Zheng
f6dbd24043 Improve doc strings (#518) 2024-06-08 02:39:32 -07:00
Lianmin Zheng
91f93f141f Crash the server when error or OOM happens (#514) 2024-06-07 19:22:34 -07:00
Qubitium
f70f72586a Fix rid state map leak + Refractor .finished (#505)
Co-authored-by: ZX <zx@lbx.dev>
2024-06-07 13:20:40 -07:00
Lianmin Zheng
c0ae70c8ed Improve logging & fix litellm dependency. (#512) 2024-06-07 13:10:32 -07:00
Ying Sheng
83525a1df2 Revert "Make the server random by default" (#492) 2024-05-31 12:00:21 -07:00
Lianmin Zheng
80a33ce8b0 Do not set the default value of global random seed (#488) 2024-05-29 18:41:18 -04:00
Ying Sheng
0463f7fb52 Support data parallelism (static) (#480)
Co-authored-by: Ying Sheng <ying.sheng@databricks.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2024-05-27 21:24:10 -07:00