sglang

Author	SHA1	Message	Date
Zhihao Zhang	24f7cb1ece	[speculative decoding] rename lookahead to ngram (#11010 ) Co-authored-by: a4zhangfei <a4zhangfei@qq.com>	2025-09-28 21:06:59 -07:00
Liangsheng Yin	4a762041d7	move `environ` into `sglang.srt` to avoid break SRT auto sync. (#10791 )	2025-09-23 02:04:20 -07:00
Liangsheng Yin	632b7d8cc9	Use simulate acc len from `sglang.environ` (#10771 )	2025-09-23 12:59:50 +08:00
Yineng Zhang	dab4663b4e	[Auto Sync] Update .clang-format (20250919) (#10670 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-09-19 12:31:44 -07:00
Yineng Zhang	610a6d6e86	fix: resolve sync issue (#10668 )	2025-09-19 12:05:39 -07:00
Zhihao Zhang	e7bc600304	[Feature] Speculative decoding support lookahead (#9873 ) Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2025-09-18 16:42:41 -07:00
lijin23	05b01ef4da	fix duplicated logger in eager_utils (#10410 )	2025-09-13 22:25:40 -07:00
Yi Zhang	297d374510	support qwen3_next blackwell (#10403 )	2025-09-13 17:18:26 +08:00
Stefan He	6c18ab46a2	[Qwen3-Next] switch to triton and cache conv states to accelerate MTP from 300 tok/s to 341 tok/s (#10335 ) Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>	2025-09-11 11:59:48 -07:00
Yi Zhang	30c6e1f569	Qwen3-Next support (#10233 ) Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: qingquansong <ustcsqq@gmail.com> Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>	2025-09-11 04:11:49 -07:00
Baizhou Zhang	8ad700f735	Cleaning codes for speculative attention mode (#10149 )	2025-09-08 17:38:06 -07:00
cicirori	8c5930f08a	Add speculator attention backend switch (#9981 )	2025-09-07 21:44:36 -07:00
Qiaolin Yu	8cda5a622c	Standalone speculative decoding (#10090 )	2025-09-07 20:55:09 -07:00
Lzhang-hub	37d83c6e6d	Qwen2.5-VL eagle3 infer (#8801 )	2025-09-07 20:44:34 -07:00
Ximingwang-09	df397a72e8	[feat] Add P/D attention select for draft model (#9755 ) Co-authored-by: 纬杭 <ximing.wxm@antgroup.com>	2025-09-02 22:47:23 -07:00
narutolhy	839c93bd2d	feat: add original logprobs to response (#8375 ) Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>	2025-08-29 11:43:57 -07:00
zyksir	aee094e430	add support for nvidia/gpt-oss-120b-Eagle3 (#9739 )	2025-08-28 00:20:20 -07:00
datdo-msft	110a65989b	[MTP] Force greedy sampling on AMD (#9127 )	2025-08-22 11:14:43 -07:00
Qiaolin Yu	9c0c1e30b2	Disable torch.compile for get_last_loc_large_page_size_large_top_k (#9507 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-08-22 02:05:02 -07:00
Qiaolin Yu	9ec314c6ac	Support speculative decoding in the trtllm_mha attention backend (#9331 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-08-21 23:53:35 -07:00
pranavm-nvidia	64574ef8c0	Enables speculative decoding for the trtllm_mla attention backend (#9238 )	2025-08-21 01:18:21 -07:00
Liangsheng Yin	eb19ccadae	[bug] fix errors related to context length in SD (#9388 )	2025-08-21 10:32:34 +08:00
jiapingW	e99729c9f3	Fixed the issue where eagle3 TPOT was not as good as without eagle3. (#9404 )	2025-08-20 16:42:01 -07:00
Even Zhou	de2dd73831	Revert "[feature] Rework Ascend NPU graph support" (#9385 )	2025-08-20 00:35:10 -07:00
Even Zhou	3680d6f88b	[feature] Rework Ascend NPU graph support (#9350 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com> Co-authored-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>	2025-08-19 20:32:27 -07:00
Even Zhou	f4fafacc5d	Revert "[feature] Ascend NPU graph support (#8027 )" (#9348 )	2025-08-19 10:11:23 -07:00
zyksir	6a9d6ca33c	fix unexcepted answer in EAGLE mode (#9252 )	2025-08-16 17:45:36 -07:00
VDV1985	94371dbbd6	[feature] Ascend NPU graph support (#8027 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com> Co-authored-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>	2025-08-16 17:25:17 -07:00
Cheng Wan	b87aacb5c5	[DP Attention] Refactor: adding some utility functions (#9136 )	2025-08-13 21:08:06 -07:00
valarLip	53f7874ae6	refine aiter_backend for mtp (#7279 ) Co-authored-by: HAI <hixiao@gmail.com>	2025-08-08 11:06:02 -07:00
Cheng Wan	cb099d2095	[CUDA Graph] save cuda graph memory by using next_token_logits_buffer (#8579 )	2025-08-03 03:06:47 -07:00
Cheng Wan	7a1f7fc504	[Feature] Hybrid EP and TP (#8590 )	2025-07-31 02:53:25 -07:00
Lianmin Zheng	ed2e313eb6	Clean up server_args, triton cache manager (#8332 )	2025-07-25 14:14:51 -07:00
Cheng Wan	c0fb25e949	DP Enhancement (#8280 )	2025-07-24 21:36:21 -07:00
J	0e7a5b2694	fix: prevent crashes due to logit bias dimension mismatch (#7685 )	2025-07-23 15:30:55 -07:00
Jay Zhou	99aefa037e	Fix eagle3 cuda graph (#8163 )	2025-07-20 15:28:06 +08:00
Lianmin Zheng	5589b75024	Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (#7756 ) Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com>	2025-07-05 12:17:05 -07:00
Cheng Wan	6c903611ca	Fix incorrect spec_num_draft_tokens in draft_extend (#7757 )	2025-07-05 02:18:16 -07:00
lukec	886d344964	support llama4 eagle3 (#6985 ) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com>	2025-06-30 22:34:10 -07:00
Lianmin Zheng	071a1f51ae	[Minor] clean up multimodal processor and tokenizer manager (#7624 )	2025-06-29 02:50:14 -07:00
u4lr451	ed0a0b692c	Perormance: Enable cuda graph for dp idle batch (#7269 ) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-06-23 17:34:13 -07:00
Cheng Wan	f8d48fd311	Fix dtype for idle input in spec decoding (#7456 )	2025-06-23 11:23:25 -07:00
Cheng Wan	ac5010e0ba	Fix CUDA Graph Check under Deepep with DP FFN (#7451 )	2025-06-22 20:35:58 -07:00
Liangsheng Yin	05c9bc8956	[minor] simplify the `TokenToKVPoolAllocator` (#7414 )	2025-06-22 12:37:18 +08:00
Cheng Wan	e879d8b7a8	[Feature] Comprehensive Hybrid Parallelism Support (#6389 )	2025-06-20 14:43:11 -07:00
u4lr451	10d60cd41b	feat: mtp support dp-attention (#6081 ) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-06-17 00:33:28 -07:00
Lianmin Zheng	1a9c2c9214	Fix AMD speculative decoding (#7252 )	2025-06-16 17:01:33 -07:00
Lianmin Zheng	c64290dcb5	Use seq_len_fill_value in the cuda graph runners (#7233 )	2025-06-16 15:57:07 -07:00
Lianmin Zheng	53a525bf33	[Eagle] Fix kernel call after updating speculative sampling kernels (#7231 )	2025-06-16 07:25:59 -07:00
Lianmin Zheng	b1286a116a	[EAGLE] Refactor code for page size > 1 & more simplifications (#7213 )	2025-06-16 03:04:29 -07:00

1 2 3

114 Commits