Ying Sheng
|
98111fbe3e
|
Revert "Chunked prefill support" (#799)
|
2024-07-29 02:38:31 -07:00 |
|
Liangsheng Yin
|
2ec39ab712
|
Chunked prefill support (#797)
|
2024-07-29 02:21:50 -07:00 |
|
Ying Sheng
|
325a06c2de
|
Fix logging (#796)
|
2024-07-28 23:01:45 -07:00 |
|
Yineng Zhang
|
dd7e8b9421
|
chore: add copyright for srt (#790)
|
2024-07-28 23:07:12 +10:00 |
|
Lianmin Zheng
|
30db99b3d9
|
Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs (#776)
|
2024-07-27 19:50:34 -07:00 |
|
Lianmin Zheng
|
a036d41980
|
Fix max new tokens (#772)
|
2024-07-27 17:22:18 -07:00 |
|
Lianmin Zheng
|
f95e661757
|
Fix max_tokens for OpenAI chat completion API (#766)
|
2024-07-27 15:44:27 -07:00 |
|
Ying Sheng
|
30d8e130e7
|
Improve benchmark scripts (#717)
|
2024-07-24 14:44:14 -07:00 |
|
Ying Sheng
|
83d2b30d75
|
format
|
2024-07-24 10:53:07 +00:00 |
|
Ying Sheng
|
4367f4bb8d
|
Fix prefill size (#711)
|
2024-07-24 03:41:15 -07:00 |
|
Liangsheng Yin
|
4cd64b8ee6
|
Auto adjust new ratio (#708)
|
2024-07-23 22:06:02 -07:00 |
|
Liangsheng Yin
|
268684439b
|
Use min new token ratio at start (#701)
|
2024-07-23 11:52:50 -07:00 |
|
Ying Sheng
|
06487f126e
|
refactor model loader: initial refactor (#664)
|
2024-07-20 02:18:22 -07:00 |
|
Liangsheng Yin
|
7620cd37dd
|
Fix jump forward when streaming (#665)
|
2024-07-19 16:42:06 -07:00 |
|
Liangsheng Yin
|
a9ef49c12c
|
Detokenize incrementally when streaming (#653)
|
2024-07-18 17:57:40 -07:00 |
|
Mingyi
|
d774acad5c
|
Remove the dependency of rpyc (#646)
|
2024-07-18 02:13:54 -07:00 |
|
Liangsheng Yin
|
3de2f30a27
|
Flashinfer sample kernel (#617)
|
2024-07-17 13:24:43 -07:00 |
|
zhyncs
|
2e341cd493
|
misc: add pre-commit config (#637)
|
2024-07-17 11:55:39 -07:00 |
|
Ying Sheng
|
a470e60c97
|
clean up step function (#635)
|
2024-07-16 20:15:24 -07:00 |
|
Ying Sheng
|
6a2941f4d0
|
Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
|
2024-07-15 07:10:51 -07:00 |
|
Mingyi
|
5ac8b80677
|
Simplify mem state (#623)
|
2024-07-15 02:01:09 -07:00 |
|
Liangsheng Yin
|
564a898ad9
|
Optimize mem indices mangement (#619)
|
2024-07-13 23:39:37 -07:00 |
|
Ying Sheng
|
5949b1ca0e
|
Fix memory pool index error (#616)
|
2024-07-13 16:45:11 -07:00 |
|
Lianmin Zheng
|
665815969a
|
Enable cuda graph by default (#612)
|
2024-07-13 05:29:46 -07:00 |
|
Ying Sheng
|
dc1b8bcfaa
|
Format (#593)
|
2024-07-05 10:06:17 -07:00 |
|
Ying Sheng
|
75b31a2a88
|
Update run_batch interface and max_prefill_tokens (#574)
|
2024-06-30 18:26:04 -07:00 |
|
sglang
|
11616fc6bd
|
Minor fix in compiler & format (#545)
|
2024-06-29 23:42:14 -07:00 |
|
Lianmin Zheng
|
badf3fa020
|
Expose dtype argument (#569)
|
2024-06-27 23:30:39 -07:00 |
|
Lianmin Zheng
|
2e6e62e156
|
Increase the number of thread limitation for tp worker managers. (#567)
|
2024-06-26 09:33:45 -07:00 |
|
Lianmin Zheng
|
a385ee27bd
|
Warmup cublas (#566)
|
2024-06-25 12:46:00 -07:00 |
|
Lianmin Zheng
|
303ef8883e
|
Clean up logits processor (#558)
|
2024-06-22 00:25:24 -07:00 |
|
Ying Sheng
|
09593e9bc9
|
Multi-node Tensor Parallelism (#550)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-06-17 20:41:24 -07:00 |
|
Qubitium-modelcloud
|
bbec01c9aa
|
Fix tp worker only checking req[0] for stream (#546)
|
2024-06-14 22:56:10 -07:00 |
|
Ying Sheng
|
fb9296f0ed
|
Higher priority for user input of max_prefill_tokens & format (#540)
|
2024-06-12 21:48:40 -07:00 |
|
Liangsheng Yin
|
9c902b1954
|
Decode Incrementally (#517)
|
2024-06-11 23:39:12 -07:00 |
|
Lianmin Zheng
|
f6dbd24043
|
Improve doc strings (#518)
|
2024-06-08 02:39:32 -07:00 |
|
Lianmin Zheng
|
91f93f141f
|
Crash the server when error or OOM happens (#514)
|
2024-06-07 19:22:34 -07:00 |
|
Qubitium
|
f70f72586a
|
Fix rid state map leak + Refractor .finished (#505)
Co-authored-by: ZX <zx@lbx.dev>
|
2024-06-07 13:20:40 -07:00 |
|
Lianmin Zheng
|
c0ae70c8ed
|
Improve logging & fix litellm dependency. (#512)
|
2024-06-07 13:10:32 -07:00 |
|
Ying Sheng
|
83525a1df2
|
Revert "Make the server random by default" (#492)
|
2024-05-31 12:00:21 -07:00 |
|
Lianmin Zheng
|
80a33ce8b0
|
Do not set the default value of global random seed (#488)
|
2024-05-29 18:41:18 -04:00 |
|
Ying Sheng
|
0463f7fb52
|
Support data parallelism (static) (#480)
Co-authored-by: Ying Sheng <ying.sheng@databricks.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2024-05-27 21:24:10 -07:00 |
|