sglang

Author	SHA1	Message	Date
Ke Bao	2c615d120f	[Feature] Support fp8 e5m2 kv cache with flashinfer (#1204 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-25 17:38:11 -07:00
Lianmin Zheng	902278008a	[Minor] Improve the function organization in TokenizerManager & improve loggers (#1208 )	2024-08-25 14:46:34 -07:00
Chayenne	30b4f771b0	Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-08-25 10:29:12 -07:00
Lianmin Zheng	f6af3a6561	Cleanup readme, llava examples, usage examples and nccl init (#1194 )	2024-08-24 08:02:23 -07:00
Lianmin Zheng	5623826f73	[Minor] Improve logging and rename the health check endpoint name (#1180 )	2024-08-21 19:24:36 -07:00
Lianmin Zheng	bea2bb9eea	Improve multi-node stability (#1171 )	2024-08-20 22:35:05 -07:00
Xu-Chen	ff2cfdb1a2	[Feature] add disable-custom-all-reduce (#1148 ) Co-authored-by: chenxu02 <chenxu02@zhihu.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-20 08:44:12 -07:00
Liangsheng Yin	3694f8f996	Mixed style of chunked prefill (#1013 )	2024-08-16 09:13:00 +00:00
Ying Sheng	93d4e354d8	[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112 )	2024-08-15 10:33:20 -07:00
Lianmin Zheng	e86b1ccbf0	Enable chunked prefill by default (#1040 )	2024-08-14 21:56:20 -07:00
Ying Sheng	96a2093ef0	[Fix] Compatibility of window attention and cuda graph (#1090 )	2024-08-14 10:37:01 -07:00
Ying Sheng	0909bb0d2f	[Feat] Add window attention for gemma-2 (#1056 )	2024-08-13 17:01:26 -07:00
Lianmin Zheng	d84c5e70f7	Test the case when max_new_tokens is very large (#1038 )	2024-08-11 16:41:03 -07:00
Lianmin Zheng	a97df79124	Clean up readme and arguments of chunked prefill (#1022 )	2024-08-11 01:18:52 -07:00
gryffindor-rr	9cf0a5bada	Add skip_tokenizer_init args. (#959 ) Co-authored-by: lzhang <zhanglei@modelbest.cn>	2024-08-09 12:14:13 -07:00
yichuan~	ffb15744b5	Support multiple args options (#941 )	2024-08-06 04:12:53 +10:00
Ying Sheng	0d4f3a9fcd	Make API Key OpenAI-compatible (#917 )	2024-08-04 13:35:44 -07:00
Ke Bao	e1eae1fd15	Support MLA for DeepSeek-V2 with Triton - step 1 (#905 )	2024-08-05 03:40:33 +10:00
任嘉	4013a4e1b0	Implement served_model_name to customize model id when use local mode… (#749 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-08-01 17:13:51 -07:00
Liangsheng Yin	c020f9ceda	Support chunked prefill when radix cache is disabled (#811 )	2024-08-01 00:29:01 -07:00
Liangsheng Yin	6b0f2e9088	Add `--max-total-tokens` (#840 )	2024-07-30 13:33:55 -07:00
Ying Sheng	b579ecf028	Add awq_marlin (#826 )	2024-07-30 02:04:51 -07:00
Ying Sheng	e7487b08bc	Adjust default mem fraction to avoid OOM (#823 )	2024-07-30 01:58:31 -07:00
Liangsheng Yin	cdcbde5fc3	Code structure refactor (#807 )	2024-07-29 23:04:48 -07:00
Liangsheng Yin	3520f75fb1	Remove inf value for chunked prefill size (#812 )	2024-07-29 18:34:25 -07:00
yichuan~	084fa54d37	Add support for OpenAI API : offline batch(file) processing (#699 ) Co-authored-by: hnyls2002 <hnyls2002@gmail.com>	2024-07-29 13:07:18 -07:00
Liangsheng Yin	7cd4f244a4	Chunked prefill (#800 )	2024-07-29 03:32:58 -07:00
Ying Sheng	98111fbe3e	Revert "Chunked prefill support" (#799 )	2024-07-29 02:38:31 -07:00
Liangsheng Yin	2ec39ab712	Chunked prefill support (#797 )	2024-07-29 02:21:50 -07:00
Yineng Zhang	dd7e8b9421	chore: add copyright for srt (#790 )	2024-07-28 23:07:12 +10:00
Lianmin Zheng	752e643007	Allow disabling flashinfer sampling kernel (#778 )	2024-07-27 20:18:56 -07:00
Mingyi	e4db4e5ba5	minor refactor: move check server args to server_args.py (#774 )	2024-07-27 19:03:40 -07:00
Liangsheng Yin	679ebcbbdc	Deepseek v2 support (#693 )	2024-07-26 17:10:07 -07:00
Liangsheng Yin	268684439b	Use min new token ratio at start (#701 )	2024-07-23 11:52:50 -07:00
Ying Sheng	c3f1aac811	Tune params (#696 )	2024-07-22 03:19:24 -07:00
Liangsheng Yin	caaad53b52	Support gpt-bigcode model class (#681 )	2024-07-20 18:34:37 -07:00
Ying Sheng	06487f126e	refactor model loader: initial refactor (#664 )	2024-07-20 02:18:22 -07:00
Ying Sheng	51fda1439f	Update Readme (#660 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-07-19 09:54:01 -07:00
zhyncs	ac971ff633	perf: reduce ttft and itl with stream_interval 1 (#658 )	2024-07-19 09:14:22 -07:00
Mingyi	d774acad5c	Remove the dependency of rpyc (#646 )	2024-07-18 02:13:54 -07:00
Ying Sheng	6a2941f4d0	Improve tensor parallel performance (#625 ) Co-authored-by: Mingyi <wisclmy0611@gmail.com>	2024-07-15 07:10:51 -07:00
Lianmin Zheng	665815969a	Enable cuda graph by default (#612 )	2024-07-13 05:29:46 -07:00
Lianmin Zheng	af4e7910e7	Clean up the usage of flashinfer (#610 )	2024-07-12 13:00:03 -07:00
Liangsheng Yin	5304b4ef58	Add `--enable-p2p-check` option (#599 )	2024-07-06 23:34:10 -07:00
Ying Sheng	dc1b8bcfaa	Format (#593 )	2024-07-05 10:06:17 -07:00
Lianmin Zheng	63fbef9876	fix flashinfer & http log level	2024-07-03 23:19:33 -07:00
Lianmin Zheng	c7709d3abe	Update install commands (#583 )	2024-07-03 02:10:59 -07:00
Ying Sheng	9380f50ff9	Turn on flashinfer by default (#578 )	2024-07-02 02:25:07 -07:00
Lianmin Zheng	badf3fa020	Expose dtype argument (#569 )	2024-06-27 23:30:39 -07:00
Lianmin Zheng	2187f36237	Add a new arguments log_level_http to control the HTTP logging (#563 )	2024-06-25 01:16:20 -07:00

1 2

85 Commits