sglang

Author	SHA1	Message	Date
narutolhy	839c93bd2d	feat: add original logprobs to response (#8375 ) Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>	2025-08-29 11:43:57 -07:00
zyksir	aee094e430	add support for nvidia/gpt-oss-120b-Eagle3 (#9739 )	2025-08-28 00:20:20 -07:00
Qiaolin Yu	9c0c1e30b2	Disable torch.compile for get_last_loc_large_page_size_large_top_k (#9507 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-08-22 02:05:02 -07:00
Qiaolin Yu	9ec314c6ac	Support speculative decoding in the trtllm_mha attention backend (#9331 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-08-21 23:53:35 -07:00
pranavm-nvidia	64574ef8c0	Enables speculative decoding for the trtllm_mla attention backend (#9238 )	2025-08-21 01:18:21 -07:00
Liangsheng Yin	eb19ccadae	[bug] fix errors related to context length in SD (#9388 )	2025-08-21 10:32:34 +08:00
zyksir	6a9d6ca33c	fix unexcepted answer in EAGLE mode (#9252 )	2025-08-16 17:45:36 -07:00
valarLip	53f7874ae6	refine aiter_backend for mtp (#7279 ) Co-authored-by: HAI <hixiao@gmail.com>	2025-08-08 11:06:02 -07:00
Cheng Wan	7a1f7fc504	[Feature] Hybrid EP and TP (#8590 )	2025-07-31 02:53:25 -07:00
Cheng Wan	c0fb25e949	DP Enhancement (#8280 )	2025-07-24 21:36:21 -07:00
Cheng Wan	6c903611ca	Fix incorrect spec_num_draft_tokens in draft_extend (#7757 )	2025-07-05 02:18:16 -07:00
lukec	886d344964	support llama4 eagle3 (#6985 ) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com>	2025-06-30 22:34:10 -07:00
u4lr451	ed0a0b692c	Perormance: Enable cuda graph for dp idle batch (#7269 ) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-06-23 17:34:13 -07:00
Cheng Wan	f8d48fd311	Fix dtype for idle input in spec decoding (#7456 )	2025-06-23 11:23:25 -07:00
u4lr451	10d60cd41b	feat: mtp support dp-attention (#6081 ) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-06-17 00:33:28 -07:00
Lianmin Zheng	c64290dcb5	Use seq_len_fill_value in the cuda graph runners (#7233 )	2025-06-16 15:57:07 -07:00
Lianmin Zheng	53a525bf33	[Eagle] Fix kernel call after updating speculative sampling kernels (#7231 )	2025-06-16 07:25:59 -07:00
Lianmin Zheng	b1286a116a	[EAGLE] Refactor code for page size > 1 & more simplifications (#7213 )	2025-06-16 03:04:29 -07:00
Lianmin Zheng	fff10809bf	Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210 )	2025-06-15 02:48:00 -07:00
Lianmin Zheng	5f1ab32717	[EAGLE] Refactor code for page size > 1 & more simplifications (#7163 )	2025-06-14 23:16:23 -07:00
kyle-pena-kuzco	b56de8f943	Open AI API hidden states (#6716 )	2025-06-10 14:37:29 -07:00
Lianmin Zheng	dc0705a504	Simplify prepare_extend_after_decode (#6987 )	2025-06-09 16:39:21 -07:00
Lianmin Zheng	0c1f03a23d	Sync cuda graph runners (#6976 )	2025-06-08 16:12:25 -07:00
Ke Bao	a2cb5913a0	Add draft extend CUDA graph for flashinfer backend (#6805 )	2025-06-02 01:51:26 -07:00
Ke Bao	7e41290082	Add draft extend CUDA graph for Triton backend (#6705 )	2025-05-29 00:13:07 -07:00
Ke Bao	631950280a	Support EAGLE draft extend CUDA graph (#6606 ) Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>	2025-05-27 02:35:17 -07:00
Ke Bao	6ce0ed073b	Apply constraint grammar to EAGLE (#6499 ) Co-authored-by: merrymercy <lianminzheng@gmail.com>	2025-05-21 17:18:41 -07:00
Lifu Huang	3cf1473a09	Use monotonic clock for interval measurement (#6211 ) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>	2025-05-17 16:49:18 -07:00
quinnrong94	2e4babdb0a	[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109 ) Co-authored-by: Yingyi <yingyihuang2000@outlook.com> Co-authored-by: neiltian <neiltian@tencent.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: kexueyu <kexueyu@tencent.com> Co-authored-by: vincentmeng <vincentmeng@tencent.com> Co-authored-by: pengmeng <pengmeng@tencent.com>	2025-05-15 00:48:09 -07:00
Lianmin Zheng	e8e18dcdcc	Revert "fix some typos" (#6244 )	2025-05-12 12:53:26 -07:00
applesaucethebun	d738ab52f8	fix some typos (#6209 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-13 01:42:38 +08:00
Lianmin Zheng	fba8eccd7e	Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-05-12 00:17:33 -07:00
applesaucethebun	2ce8793519	Add typo checker in pre-commit (#6179 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-11 12:55:00 +08:00
Ying Sheng	11383cec3c	[PP] Add pipeline parallelism (#5724 )	2025-04-30 18:18:07 -07:00
JieXin Liang	97cb762bb6	[misc] remove is_cuda_available (#5319 )	2025-04-20 18:16:51 -07:00
u4lr451	211c7b31b8	Fix: Incorrect parameters passed to forward_batch_generation (#5506 ) (#5511 )	2025-04-17 18:49:59 -07:00
fzyzcjy	86a876d883	Optimize topk operation in llama4 (#5128 )	2025-04-09 02:50:22 -07:00
Baizhou Zhang	efbae697b3	[Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052 )	2025-04-05 01:23:02 -07:00
Qingquan Song	e983e43248	Add Eagle Speculative Decoding to FA3 Backend (#4951 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com>	2025-04-02 13:09:02 -07:00
Lianmin Zheng	b26bc86b36	Support page size > 1 + eagle (#4908 )	2025-03-30 00:46:23 -07:00
Brayden Zhong	e84f4ba0ab	[Misc] Fix issues reported by torchfix (#4837 )	2025-03-27 20:10:32 -07:00
James Liu	9e0186f352	[Feature] Support EAGLE 3 (#4247 )	2025-03-18 07:35:23 -07:00
Ying Sheng	1b859295f4	[Eagle] Remove the greedy branch and some redundant code (#4363 ) Co-authored-by: Sehoon Kim <sehoon@x.ai>	2025-03-16 02:48:55 -07:00
Baizhou Zhang	9fb48f951f	Support nextn for flashinfer mla attention backend (#4218 )	2025-03-09 00:01:54 -08:00
Lianmin Zheng	d4017a6b63	[EAGLE] many fixes for eagle (#4195 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Sehoon Kim <sehoon@x.ai>	2025-03-07 22:12:13 -08:00
Lianmin Zheng	bc1534ff32	Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134 ) Co-authored-by: Sehoon Kim <kssteven418@gmail.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Sehoon Kim <sehoon@x.ai>	2025-03-06 06:13:59 -08:00
Ying Sheng	d3d4d76758	[Eagle] Refactor eagle speculative decoding (#3986 ) Co-authored-by: Ke Bao <ISPObaoke@163.com>	2025-03-05 08:06:07 -08:00
Lianmin Zheng	77a3954bf7	Simplify eagle tests and TP sync in grammar backend (#4066 )	2025-03-04 13:40:40 -08:00
William	0d4e3228cf	[Feature] Add test for speculative_token_map (#4016 )	2025-03-04 04:26:24 -08:00
Ke Bao	9fafa62db7	Share target model embed and head weights for nextn (#4033 )	2025-03-03 13:30:04 -08:00

1 2

62 Commits