sglang

Author	SHA1	Message	Date
Scott Lee	b6fb5d7666	Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441 )	2025-10-13 11:24:27 -07:00
Liangsheng Yin	bfadb5ea5f	Adjust overlap event loop (#11507 )	2025-10-14 00:33:19 +08:00
Liangsheng Yin	516738b096	Depreate `global_server_args_dict` (#11528 )	2025-10-13 19:34:43 +08:00
Yi Zhang	a55cf5304a	[Feature] Support mamba radix cache v0 (#11214 ) Co-authored-by: hanming-lu <hanming@x.ai> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: thalahors <ericalcaide1@gmail.com>	2025-10-12 20:57:15 -07:00
Cheng Wan	1bdd010291	Revert "Deprecate `global_server_args_dict`" (#11520 )	2025-10-12 17:40:40 -07:00
Liangsheng Yin	1083e7e3df	Deprecate `global_server_args_dict` (#11331 )	2025-10-13 01:20:47 +08:00
Liangsheng Yin	f49419061d	Move args from `global_config` to `environ` (#11332 )	2025-10-12 21:29:31 +08:00
Liangsheng Yin	20a6c0a63d	Beta spec-overlap for EAGLE (#11398 ) Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>	2025-10-12 11:02:22 +08:00
Glen Liu	47c606d3dc	[Feature] support regex strings as a stopping condition (#10635 )	2025-10-12 10:53:15 +08:00
ybyang	5061b8fd3e	fix stop when stream (#11462 ) Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2025-10-11 22:06:31 +08:00
cctry	b36afed4a7	Separate allocation logic from scheduler (#11313 )	2025-10-10 17:38:54 -07:00
Scott Lee	55b14656e6	Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (#11433 )	2025-10-10 12:54:57 -07:00
Cheng Wan	52fcbbb8bd	Revert "perf: optimize qwen-vl with symm mem allreduce" (#11436 )	2025-10-10 12:30:05 -07:00
Yuan Luo	3b9d97f335	perf: optimize qwen-vl with symm mem allreduce (#11381 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-10 22:24:45 +08:00
Scott Lee	0babd48736	Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11144 )	2025-10-10 00:46:44 -07:00
Sundara Raman Ramachandran	53bd00d975	[Generative Score API] Multi-Item scoring with custom attention mask. (#10979 )	2025-10-08 18:47:32 -07:00
cctry	f3764c26a3	Clean match_prefix and prepare_for_extend for mem cache V2 (#11200 )	2025-10-07 17:54:18 -07:00
Liangsheng Yin	501dfa6b42	Remove sampling info events and overlap thread file (#11300 )	2025-10-07 21:34:25 +08:00
Liangsheng Yin	1519a89cfd	Remove overlap thread (#11210 ) Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>	2025-10-07 20:12:12 +08:00
Lianmin Zheng	708f4ff490	Rename max_micro_batch_size -> pp_max_micro_batch_size (#11279 )	2025-10-06 15:50:56 -07:00
fzyzcjy	efbc687c28	Support DeepSeek V3.2 Exp (#11061 ) Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com> Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com> Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com> Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Zhengda Qin <zhengdqin@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-10-06 00:24:15 -07:00
Liangsheng Yin	4cb5a5235e	Tiny `skip_sample` adjust (#11225 )	2025-10-05 23:41:04 +08:00
Liangsheng Yin	458611de77	Unify forward output datastructure (#11124 )	2025-10-03 00:28:57 +08:00
Liangsheng Yin	25e7dbe8af	Fix ngram spec with page size > 1 (#11135 )	2025-10-02 12:34:23 +08:00
Zhang Junda	0b2aa8a70c	Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (#10720 ) Co-authored-by: hnyls2002 <lsyincs@gmail.com>	2025-10-02 10:51:25 +08:00
Lianmin Zheng	2d62af6be5	Fix metrics and request tracing (TimeStats) (#11123 )	2025-10-01 13:03:07 -07:00
Liangsheng Yin	73d4a5f879	Organize spec-related data structures (#10735 )	2025-10-01 09:45:30 +08:00
Ke Bao	91847e382a	Fix eagle radix cache (#10846 )	2025-09-30 22:59:20 +08:00
Ke Bao	424591d53d	Fix spec filter batch when target extend (#10991 )	2025-09-30 14:44:02 +08:00
narutolhy	d17986f8c6	Enable optional FP32 compute for LM Head (#10729 ) Thanks to MiniMax Team and Chenyang Zhao's support.	2025-09-29 20:45:17 -07:00
Lianmin Zheng	dda34c2f93	Fix mem fraction static for nightly tests (#11076 )	2025-09-29 12:57:41 -07:00
Zhihao Zhang	24f7cb1ece	[speculative decoding] rename lookahead to ngram (#11010 ) Co-authored-by: a4zhangfei <a4zhangfei@qq.com>	2025-09-28 21:06:59 -07:00
Shangming Cai	e23e280e16	Add support for topk metadata transferring for PD (#10616 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2025-09-28 00:09:38 +08:00
Xinyuan Tong	71f24ef8f6	feat: add cache_salt support to request (#10718 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2025-09-23 23:30:25 -07:00
Qiaolin Yu	e2ac7888b8	[2/2] Support deterministic inference for temperature > 0 (#10678 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-09-21 19:36:08 -07:00
Xinyuan Tong	12d6cf18f0	Refactors radix cache for extra key support (#10317 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2025-09-22 02:16:16 +08:00
Baizhou Zhang	8ecef73f12	[1/2] Support deterministic inference with flashinfer attention backend (#10645 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2025-09-19 23:34:29 -07:00
Zhihao Zhang	e7bc600304	[Feature] Speculative decoding support lookahead (#9873 ) Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2025-09-18 16:42:41 -07:00
harrisonlimh	14fdd52740	feat: add priority based scheduling with priority based request acceptance and preemption (#8746 )	2025-09-16 17:10:10 -07:00
Yingchun Lai	b1721edbac	[PD metrics] Add latency Histogram metrics of each stage for generate requests (#8710 )	2025-09-16 01:52:49 +08:00
Cheng Wan	4844fac91d	Refactor TopK to ensure readability and extensibility (#9338 )	2025-09-14 19:16:25 -07:00
Liangsheng Yin	6897e06b69	Remove repeatedly lists adding in `init_incremental_detokenization` (#10412 )	2025-09-14 10:05:52 +08:00
Sundara Raman Ramachandran	a360511d7b	[Generative Score API] Scoring(Prefill-only) optimizations. (#9748 )	2025-09-14 01:57:06 +08:00
Yi Zhang	30c6e1f569	Qwen3-Next support (#10233 ) Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: qingquansong <ustcsqq@gmail.com> Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>	2025-09-11 04:11:49 -07:00
DarkSharpness	948b01a04c	[Refactor] Remove Hicache Load & Write threads (#10127 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-09-08 22:18:50 -07:00
Baizhou Zhang	8ad700f735	Cleaning codes for speculative attention mode (#10149 )	2025-09-08 17:38:06 -07:00
cicirori	8c5930f08a	Add speculator attention backend switch (#9981 )	2025-09-07 21:44:36 -07:00
Zhiqiang Xie	3b99f23c44	[Bugfix] Retract not releasing enough memory when page size > 1 (#9989 )	2025-09-07 21:41:50 -07:00
Qiaolin Yu	8cda5a622c	Standalone speculative decoding (#10090 )	2025-09-07 20:55:09 -07:00
Cheng Wan	3fa62da78c	[7/N] MoE Refactor: the implementation of new framework (#9269 )	2025-09-05 21:09:09 -07:00

1 2 3 4 5 ...

346 Commits