sglang

Author	SHA1	Message	Date
Lianmin Zheng	46d4431889	Add a new api configure_logging to allow dumping the requests (#2875 )	2025-01-13 14:24:00 -08:00
fzyzcjy	923f518337	CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630 )	2025-01-13 11:38:51 -08:00
Xiaoyu Zhang	d08c77c434	Sampling penalties memory interface (#2870 )	2025-01-13 23:09:00 +08:00
Lianmin Zheng	c1e097ca66	Revert "Dump requests to a folder" (#2869 )	2025-01-13 06:21:25 -08:00
Lzhang-hub	6ec75e626d	add qwen2 eagle model (#2863 )	2025-01-13 05:29:33 -08:00
Yineng Zhang	d855653bd4	minor: fix release docs (#2868 )	2025-01-13 21:18:39 +08:00
Lianmin Zheng	336ff5b9f5	Fix typos in io_struct.py (#2867 )	2025-01-13 05:13:02 -08:00
Lianmin Zheng	3b141e1509	Dump requests (#2862 )	2025-01-13 04:51:56 -08:00
Lianmin Zheng	6249e4a19e	Revert "Integration of TurboMind AWQ" (#2866 )	2025-01-13 04:44:39 -08:00
Ke Bao	f3516c2894	Fix quant kernel accuracy issue (#2865 )	2025-01-13 20:32:17 +08:00
bjmsong	17de02f98d	Integration of TurboMind AWQ (#2828 ) Co-authored-by: root <bjmsong@126.com>	2025-01-13 20:14:16 +08:00
Lianmin Zheng	51ab3ccf47	Collect more metrics: num_requests_total (#2859 )	2025-01-13 03:57:39 -08:00
Lianmin Zheng	67008f4b32	Use only one GPU for MLA CI tests (#2858 )	2025-01-13 03:55:33 -08:00
Yineng Zhang	4536d72446	minor: use ubuntu-latest instead of self-hosted runner for amd build (#2861 )	2025-01-13 18:58:56 +08:00
Yineng Zhang	41d7e5b7e6	docs: update link (#2857 )	2025-01-13 18:40:48 +08:00
Yineng Zhang	20a9f5dfe0	fix: not delete CNAME (#2860 )	2025-01-13 18:36:40 +08:00
kk	42f3909963	Unify sglang coding style (#2856 ) Co-authored-by: Lin, Soga <soga.lin@amd.com>	2025-01-13 02:12:44 -08:00
Lianmin Zheng	72c7776355	Fix linear.py and improve weight loading (#2851 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-01-13 01:39:14 -08:00
justdoit	4093aa4660	[Fix]eagle2 health_generate is first request,apiserver will core (#2853 )	2025-01-13 01:01:21 -08:00
kk	e808c1df3e	Integrate ROCm ater package for ck moe function feasibility (#2854 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Lin, Soga <soga.lin@amd.com>	2025-01-13 08:23:07 +00:00
sogalin	a18ab81ddd	Update base image for ROCm (#2852 ) Co-authored-by: HAI <hixiao@gmail.com>	2025-01-13 14:39:44 +08:00
bjmsong	0bb0f76311	Support FP8 E4M3 KV Cache (#2786 ) Co-authored-by: root <bjmsong@126.com>	2025-01-12 21:17:11 -08:00
Ke Bao	85b2e05770	Add int8 quant kernel (#2848 )	2025-01-13 13:16:58 +08:00
Yineng Zhang	a879c2fb4c	fix sgl-kernel build (#2850 )	2025-01-13 12:27:17 +08:00
Xiaoyu Zhang	e2b16c4716	add sampling_scaling_penalties kernel (#2846 )	2025-01-12 19:38:17 -08:00
Shi Shuai	c4f9707e16	Improve: Token-In Token-Out Usage for RLHF (#2843 )	2025-01-11 15:14:26 -08:00
Yineng Zhang	197cbf9bab	docs: update README (#2841 )	2025-01-11 23:11:38 +08:00
Yineng Zhang	f624901cdd	chore: bump v0.4.1.post5 (#2840 )	2025-01-11 23:10:02 +08:00
Xiaoyu Zhang	f0e15dc6ab	[HotFix] fix fp8 scale load failed in tp>1 (#2837 )	2025-01-11 14:34:26 +08:00
Lianmin Zheng	f1769586d6	Update threshold in test_nightly_gsm8k_eval.py (#2836 )	2025-01-10 20:37:34 -08:00
Zhiqiang Xie	5d6e9467d4	Cache controller for hierarchical caching (#2804 )	2025-01-10 20:22:01 -08:00
justdoit	a47bf39123	[Eagle2] Fix multiple concurrent request crashes (#2730 )	2025-01-10 14:00:43 -08:00
TianYu GUO	b170646991	Fix port number overflow (#2826 )	2025-01-10 13:44:32 -08:00
Muqi Li	5413ec2bbe	[Bugfix] Fix bug in fork logic caused by null text_ (#2835 )	2025-01-10 13:37:00 -08:00
Chang Su	f290bd4332	[Bugfix] Fix embedding model hangs with `--enable-metrics` (#2822 )	2025-01-10 13:14:51 -08:00
Pratyush Patel	8f15789314	Add more metrics to serving benchmark. (#2819 )	2025-01-10 23:30:44 +08:00
Lianmin Zheng	2db03a04ca	Update README.md (#2833 ) Co-authored-by: Heiner <heiner@x.ai>	2025-01-10 03:49:04 -08:00
Chayenne	5cc1170552	Doc: add block-wise FP8 in dpsk model reference (#2830 )	2025-01-10 00:26:59 -08:00
Xiaotong Jiang	11fffbc95a	[Doc]: Deepseek reference docs (#2787 )	2025-01-09 13:43:12 -08:00
sleepcoo	4f077c01b8	minor: support specifying local dataset path for gsm8k and hellaswag (#2816 )	2025-01-09 22:24:42 +08:00
Lianmin Zheng	679c3bcacf	Fix typo in cuda_graph_bs (#2813 )	2025-01-09 03:03:24 -08:00
Yunmeng	656aed58c6	Remove vllm dependency in model config (#2809 )	2025-01-09 17:51:56 +08:00
Ke Bao	b5fb4ef58a	Update modelopt config and fix running issue (#2792 )	2025-01-08 18:04:30 +08:00
Chayenne	2e6346fc2e	Docs：Update the style of llma 3.1 405B docs (#2789 )	2025-01-08 01:07:54 -08:00
mlmz	977f785dad	Docs: Rewrite docs for LLama 405B and ModelSpace (#2773 ) Co-authored-by: Chayenne <zhaochen20@outlook.com>	2025-01-08 00:02:59 -08:00
Lianmin Zheng	8a6906127a	Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784 ) Co-authored-by: SangBin Cho rkooo567@gmail.com	2025-01-07 23:29:10 -08:00
JJJJOHNSON	694e41925e	[eagle2] fix end check when target model verify (#2723 )	2025-01-07 21:46:02 -08:00
Lianmin Zheng	b22f3f6475	Fix nightly accuracy tests (#2780 )	2025-01-07 21:02:35 -08:00
Lianmin Zheng	6fb5768372	Disable math eval on nightly CI temporarily (#2779 )	2025-01-07 18:17:34 -08:00
Zhiqiang Xie	51caee740f	Host memory pool for hierarchical caching (#2771 )	2025-01-07 21:38:37 +00:00

1 2 3 4 5 ...

1708 Commits