sglang

Author	SHA1	Message	Date
Lianmin Zheng	bbc07c4197	Move sampling logits to float32 (#773 )	2024-07-27 17:30:12 -07:00
Lianmin Zheng	a036d41980	Fix max new tokens (#772 )	2024-07-27 17:22:18 -07:00
Lianmin Zheng	f95e661757	Fix max_tokens for OpenAI chat completion API (#766 )	2024-07-27 15:44:27 -07:00
Lianmin Zheng	0736b27020	[Minor] Improve the code style in TokenizerManager (#767 )	2024-07-27 05:05:15 -07:00
Ke Bao	3fdab91912	Fix TransformerTokenizer init for chatglm2 & 3 (#761 )	2024-07-27 02:44:46 -07:00
Liangsheng Yin	d9fccfefe2	Fix context length (#757 )	2024-07-26 18:13:13 -07:00
Liangsheng Yin	679ebcbbdc	Deepseek v2 support (#693 )	2024-07-26 17:10:07 -07:00
Yineng Zhang	6b32bb1c0b	misc: format (#741 )	2024-07-26 21:00:51 +10:00
Toshiki Kataoka	40facad5f1	feat: support token ids in /v1/completions (#736 )	2024-07-26 02:53:17 -07:00
Toshiki Kataoka	da504445dc	fix /generate without sampling_params (#734 )	2024-07-26 01:27:56 -07:00
Ying Sheng	252e0f7bbd	fix: small bug for llama-405b fp16 (#733 )	2024-07-25 21:14:54 -07:00
Ying Sheng	8fbba3de3d	Fix bugs (fp8 checkpoints, triton cache manager) (#729 )	2024-07-25 07:42:00 -07:00
Ying Sheng	ae0f6130cb	Revert "fix: fp8 config" (#728 )	2024-07-25 07:25:33 -07:00
Liangsheng Yin	04ec6ba2ac	Fix dockerfile and triton cache manager (#720 )	2024-07-25 03:04:21 -07:00
Ying Sheng	d63f13c13b	fix: fp8 config (#723 )	2024-07-25 02:01:56 -07:00
Ying Sheng	30d8e130e7	Improve benchmark scripts (#717 )	2024-07-24 14:44:14 -07:00
Yineng Zhang	e17deb27b5	fix: llama 3.1 405b fp8 (#714 )	2024-07-24 09:37:41 -07:00
Ying Sheng	83d2b30d75	format	2024-07-24 10:53:07 +00:00
Ying Sheng	4367f4bb8d	Fix prefill size (#711 )	2024-07-24 03:41:15 -07:00
Lianmin Zheng	00e4baa728	Update schedule_heuristic.py	2024-07-24 01:22:30 -07:00
Liangsheng Yin	4cd64b8ee6	Auto adjust new ratio (#708 )	2024-07-23 22:06:02 -07:00
Lianmin Zheng	01d66ae2e8	Fix multi-node deadlock (#709 )	2024-07-23 21:53:36 -07:00
Mingyi	a523a3c13a	Reduce hardcoded logic of kernel usage (#707 )	2024-07-23 16:42:21 -07:00
Ying Sheng	444a02441a	Update vllm version to support llama3.1 (#705 )	2024-07-23 13:49:34 -07:00
Liangsheng Yin	268684439b	Use min new token ratio at start (#701 )	2024-07-23 11:52:50 -07:00
Ke Bao	824a77d04d	Fix hf config loading (#702 )	2024-07-23 11:39:08 -07:00
Ying Sheng	cf99eab7d5	Fix flashinfer (#700 )	2024-07-23 01:27:01 -07:00
Ying Sheng	c3f1aac811	Tune params (#696 )	2024-07-22 03:19:24 -07:00
Ke Bao	5303c1ed22	Support Mistral-Nemo (#691 )	2024-07-22 03:36:53 +10:00
Liangsheng Yin	eedc12e12e	Support Deepseek MoE Model (#689 )	2024-07-21 03:09:29 -07:00
Lianmin Zheng	77e592e8e0	support non-streaming benchmark (#682 )	2024-07-20 18:36:42 -07:00
Liangsheng Yin	caaad53b52	Support gpt-bigcode model class (#681 )	2024-07-20 18:34:37 -07:00
Liangsheng Yin	69d19188fc	Decouple kv (#679 )	2024-07-20 14:16:45 -07:00
Ke Bao	0ac94c36cb	Fallback when sampling failed (#678 )	2024-07-20 10:44:54 -07:00
Liangsheng Yin	f424e76d96	Fix illegal tokens during sampling (#676 )	2024-07-20 03:11:15 -07:00
Lianmin Zheng	490a1f39dd	Fix cuda graph with flashinfer (#675 )	2024-07-20 02:43:55 -07:00
Ying Sheng	06487f126e	refactor model loader: initial refactor (#664 )	2024-07-20 02:18:22 -07:00
Liangsheng Yin	39c57317e1	Revert "Temporary fix invalid sample results" (#673 )	2024-07-20 02:06:31 -07:00
Lianmin Zheng	35759efa91	Support random dataset in bench_serving.py (#669 )	2024-07-20 01:06:43 -07:00
Liangsheng Yin	8f4b1559e7	Temporary fix invalid sample results (#668 )	2024-07-20 00:51:05 -07:00
Mingyi	e3046ea3a8	Update OpenAI API (#667 )	2024-07-19 23:20:54 -07:00
yichuan~	49c5e0eca9	Add support for OpenAI API parallel sampling (#640 )	2024-07-19 23:10:01 -07:00
Ke Bao	ec2150b294	Fix kill process util (#666 )	2024-07-19 21:43:11 -07:00
Liangsheng Yin	7620cd37dd	Fix jump forward when streaming (#665 )	2024-07-19 16:42:06 -07:00
Ying Sheng	e87c7fd501	Improve docs (#662 )	2024-07-19 10:58:03 -07:00
Ying Sheng	51fda1439f	Update Readme (#660 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-07-19 09:54:01 -07:00
Ying Sheng	2d96da813e	refactor model loader [unreachable code]: initial refactor (#655 )	2024-07-19 09:27:06 -07:00
zhyncs	c126a6ccba	feat: add benchmark serving (#657 )	2024-07-19 09:15:21 -07:00
zhyncs	ac971ff633	perf: reduce ttft and itl with stream_interval 1 (#658 )	2024-07-19 09:14:22 -07:00
Lianmin Zheng	e1792cca24	Remove cached triton launcher (#656 )	2024-07-18 23:28:40 -07:00

1 2 3 4 5 ...

284 Commits