Commit Graph

85 Commits

Author SHA1 Message Date
Ke Bao
2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-25 17:38:11 -07:00
Lianmin Zheng
902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) 2024-08-25 14:46:34 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00
Lianmin Zheng
f6af3a6561 Cleanup readme, llava examples, usage examples and nccl init (#1194) 2024-08-24 08:02:23 -07:00
Lianmin Zheng
5623826f73 [Minor] Improve logging and rename the health check endpoint name (#1180) 2024-08-21 19:24:36 -07:00
Lianmin Zheng
bea2bb9eea Improve multi-node stability (#1171) 2024-08-20 22:35:05 -07:00
Xu-Chen
ff2cfdb1a2 [Feature] add disable-custom-all-reduce (#1148)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-20 08:44:12 -07:00
Liangsheng Yin
3694f8f996 Mixed style of chunked prefill (#1013) 2024-08-16 09:13:00 +00:00
Ying Sheng
93d4e354d8 [Fix] Window attention compatible with RadixAttention and chunked prefill (#1112) 2024-08-15 10:33:20 -07:00
Lianmin Zheng
e86b1ccbf0 Enable chunked prefill by default (#1040) 2024-08-14 21:56:20 -07:00
Ying Sheng
96a2093ef0 [Fix] Compatibility of window attention and cuda graph (#1090) 2024-08-14 10:37:01 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Lianmin Zheng
d84c5e70f7 Test the case when max_new_tokens is very large (#1038) 2024-08-11 16:41:03 -07:00
Lianmin Zheng
a97df79124 Clean up readme and arguments of chunked prefill (#1022) 2024-08-11 01:18:52 -07:00
gryffindor-rr
9cf0a5bada Add skip_tokenizer_init args. (#959)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-09 12:14:13 -07:00
yichuan~
ffb15744b5 Support multiple args options (#941) 2024-08-06 04:12:53 +10:00
Ying Sheng
0d4f3a9fcd Make API Key OpenAI-compatible (#917) 2024-08-04 13:35:44 -07:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
任嘉
4013a4e1b0 Implement served_model_name to customize model id when use local mode… (#749)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-01 17:13:51 -07:00
Liangsheng Yin
c020f9ceda Support chunked prefill when radix cache is disabled (#811) 2024-08-01 00:29:01 -07:00
Liangsheng Yin
6b0f2e9088 Add --max-total-tokens (#840) 2024-07-30 13:33:55 -07:00
Ying Sheng
b579ecf028 Add awq_marlin (#826) 2024-07-30 02:04:51 -07:00
Ying Sheng
e7487b08bc Adjust default mem fraction to avoid OOM (#823) 2024-07-30 01:58:31 -07:00
Liangsheng Yin
cdcbde5fc3 Code structure refactor (#807) 2024-07-29 23:04:48 -07:00
Liangsheng Yin
3520f75fb1 Remove inf value for chunked prefill size (#812) 2024-07-29 18:34:25 -07:00
yichuan~
084fa54d37 Add support for OpenAI API : offline batch(file) processing (#699)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-07-29 13:07:18 -07:00
Liangsheng Yin
7cd4f244a4 Chunked prefill (#800) 2024-07-29 03:32:58 -07:00
Ying Sheng
98111fbe3e Revert "Chunked prefill support" (#799) 2024-07-29 02:38:31 -07:00
Liangsheng Yin
2ec39ab712 Chunked prefill support (#797) 2024-07-29 02:21:50 -07:00
Yineng Zhang
dd7e8b9421 chore: add copyright for srt (#790) 2024-07-28 23:07:12 +10:00
Lianmin Zheng
752e643007 Allow disabling flashinfer sampling kernel (#778) 2024-07-27 20:18:56 -07:00
Mingyi
e4db4e5ba5 minor refactor: move check server args to server_args.py (#774) 2024-07-27 19:03:40 -07:00
Liangsheng Yin
679ebcbbdc Deepseek v2 support (#693) 2024-07-26 17:10:07 -07:00
Liangsheng Yin
268684439b Use min new token ratio at start (#701) 2024-07-23 11:52:50 -07:00
Ying Sheng
c3f1aac811 Tune params (#696) 2024-07-22 03:19:24 -07:00
Liangsheng Yin
caaad53b52 Support gpt-bigcode model class (#681) 2024-07-20 18:34:37 -07:00
Ying Sheng
06487f126e refactor model loader: initial refactor (#664) 2024-07-20 02:18:22 -07:00
Ying Sheng
51fda1439f Update Readme (#660)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-07-19 09:54:01 -07:00
zhyncs
ac971ff633 perf: reduce ttft and itl with stream_interval 1 (#658) 2024-07-19 09:14:22 -07:00
Mingyi
d774acad5c Remove the dependency of rpyc (#646) 2024-07-18 02:13:54 -07:00
Ying Sheng
6a2941f4d0 Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
2024-07-15 07:10:51 -07:00
Lianmin Zheng
665815969a Enable cuda graph by default (#612) 2024-07-13 05:29:46 -07:00
Lianmin Zheng
af4e7910e7 Clean up the usage of flashinfer (#610) 2024-07-12 13:00:03 -07:00
Liangsheng Yin
5304b4ef58 Add --enable-p2p-check option (#599) 2024-07-06 23:34:10 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
Lianmin Zheng
63fbef9876 fix flashinfer & http log level 2024-07-03 23:19:33 -07:00
Lianmin Zheng
c7709d3abe Update install commands (#583) 2024-07-03 02:10:59 -07:00
Ying Sheng
9380f50ff9 Turn on flashinfer by default (#578) 2024-07-02 02:25:07 -07:00
Lianmin Zheng
badf3fa020 Expose dtype argument (#569) 2024-06-27 23:30:39 -07:00
Lianmin Zheng
2187f36237 Add a new arguments log_level_http to control the HTTP logging (#563) 2024-06-25 01:16:20 -07:00