sglang

Author	SHA1	Message	Date
jiahanc	eec9e471ca	[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563 ) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>	2025-10-22 13:11:16 -07:00
Lianmin Zheng	6d535b719f	Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980 )	2025-10-22 11:50:26 -07:00
yuho	fdcb1d13c5	[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977 )	2025-10-22 11:29:55 -07:00
Hongbo Xu	d7e834d6ba	[6/n]decouple quantization implementation from vLLM dependency (#10750 )	2025-10-23 02:07:55 +08:00
Fan Yin	1d097aac87	[Fix] Remove unused import from triton_kernels_moe.py (#11967 ) Co-authored-by: Shangming Cai <171321666+shangmingcai@users.noreply.github.com>	2025-10-22 21:02:57 +08:00
996_icu	88568c01eb	[model] Support POINTSV15Chat (#9651 ) Co-authored-by: josephyou <josephyou@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: root <root@TENCENT64.site>	2025-10-22 16:58:17 +08:00
Hank Han	904655c5fd	[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606 ) Co-authored-by: Xun Sun <UNIDY2002@outlook.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-22 01:13:31 -07:00
Xun Sun	e028af6998	Fix mooncake dispatcher (#11908 )	2025-10-22 01:11:49 -07:00
Zhiyu	80b2b3207a	Enable native ModelOpt quantization support (3/3) (#10154 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-10-21 21:44:29 -07:00
Liangsheng Yin	9d61205dac	[lint] improve ruff check (#11922 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2025-10-22 11:32:50 +08:00
Chang Su	70f6309cd4	[router][grpc] Support `v1/responses` API (#11926 )	2025-10-21 17:41:48 -07:00
Yineng Zhang	704160017d	fix: resolve flashinfer 0.4.1 import (#11940 )	2025-10-21 17:19:57 -07:00
Yineng Zhang	c461e7714d	[Auto Sync] Update forward_batch_info.py (20251021) (#11934 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: yinghui <32845984+cicirori@users.noreply.github.com>	2025-10-21 15:52:15 -07:00
Zheng Wengang	fde2decf8b	[BugFix][Qwen3-VL]: add metadata for video in qwen3-vl (#11377 )	2025-10-21 15:36:01 -07:00
Yineng Zhang	9792b9d7e3	chore: upgrade flashinfer 0.4.1 (#11933 )	2025-10-21 14:46:31 -07:00
Baizhou Zhang	ef4a8097b8	Rename flashmla kernel options of nsa backend for better readability (#11876 )	2025-10-21 13:14:16 -07:00
Baizhou Zhang	ebff4ee648	Update sgl-kernel and remove fast hadamard depedency (#11844 )	2025-10-21 13:13:54 -07:00
Serge Panev	2b1da821b5	[NVIDIA] Add new SMs support for Spark & Thor (#11287 ) Signed-off-by: Serge Panev <spanev@nvidia.com>	2025-10-22 02:02:24 +08:00
Liangsheng Yin	97710ccd1a	Fix flush cache API for spec v2 (#11918 )	2025-10-21 23:01:16 +08:00
Kai-Hsun Chen	c61b0b294c	[quantization][MoE] fix the check for `tp_size` / `moe_ep_size` / `moe_intermediate_size` / `weight_block_size_n` (#11702 ) Signed-off-by: Kai-Hsun Chen <khchen@x.ai>	2025-10-21 21:25:28 +08:00
Vincent Zhong	e8640ee9be	[smol] [perf] Inverse perm improvement (#11482 ) Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>	2025-10-21 19:18:10 +08:00
b8zhong	d0a64c7e2c	vlm: enforce pybase64 for image and str encode/decode (#10700 )	2025-10-21 19:05:32 +08:00
Zhengke Zhou	260fe755b6	Simplify multi-tokenizer (#11295 ) Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>	2025-10-21 16:33:29 +08:00
ybyang	dbb16bedd5	Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572 ] (#11416 ) Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: YorkSu <york_su@qq.com>	2025-10-21 16:27:56 +08:00
Neelabh Sinha	852c0578fd	[FEATURE] Add OpenAI-Compatible LoRA Adapter Selection (#11570 )	2025-10-21 15:44:33 +08:00
Atream	7e6191c098	init support for KTransformers Heterogeneous Computing (#11487 ) Co-authored-by: Jianwei Dong <1913953267@qq.com>	2025-10-21 00:17:02 -07:00
Gaurav Verma	6f9b66bdda	[AMD] Update wave-lang to 3.8.0 (#11878 ) Signed-off-by: xintin <gaurav.verma@amd.com>	2025-10-20 23:11:09 -07:00
Qiaolin Yu	d9a20fd28a	Use trtllm_mla decode kernel for draft extend in speculative decoding (#11664 )	2025-10-21 11:42:09 +08:00
Meng, Hengyu	b113c72e7a	Init attention backend for Intel XPU (#10656 ) Co-authored-by: guangyey <guangye.yu@intel.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>	2025-10-21 11:41:28 +08:00
zhangdonghao-zdh	fb6cc7b000	Fix RotaryEmbedding for fp32 input (#11843 )	2025-10-21 10:56:48 +08:00
Xiaoyu Zhang	8374a96e49	piecewise cuda graph support qwen3-moe (#11845 )	2025-10-21 10:55:49 +08:00
Yuan Luo	74de76c685	Revise MRotaryEmbedding's forward (#11859 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: 羽癫 <yudian.zy@antgroup.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>	2025-10-21 10:38:29 +08:00
Chang Su	9c0b1eb5ad	[router][grpc] Fix wram-up random token ids for small models (#11887 )	2025-10-20 19:22:17 -07:00
Lianmin Zheng	01f14a7ad2	[code move] move pp into a separate mixin (#11838 )	2025-10-20 18:46:56 -07:00
Lianmin Zheng	43ad05907c	[Auto Sync] Update scheduler.py, server_args.py (20251020) (#11875 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Kan Wu <wukanustc@gmail.com>	2025-10-20 17:41:19 -07:00
fzyzcjy	0917c5da8c	Support mixing cutedsl and deepgemm backend (#11807 )	2025-10-21 07:38:35 +08:00
penguin_wwy	184a4df697	Replace function call with set literal (#11867 )	2025-10-21 01:39:16 +08:00
Qiaolin Yu	f7b1d8c5ab	Fix acc len and gen throughput metrics when enabling overlap-spec (#11823 ) Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>	2025-10-21 01:34:38 +08:00
Cheng Wan	bfc3b3f786	[9/N] MoE Refactor: cleanup dispatcher interfaces (#11847 )	2025-10-20 10:11:46 -07:00
Liangsheng Yin	da5bde4d16	Tiny fix main lint (#11862 )	2025-10-20 19:57:24 +08:00
DarkSharpness	276e7b3e4e	[Feature] New structural tag support (#10691 )	2025-10-20 18:25:58 +08:00
ishandhanani	296f689242	fix(server_args): handle tokenizer init conflicts (#11776 )	2025-10-20 00:27:19 -07:00
Shane A	d383e6616e	[Model] Add Olmo 3 model support (#11396 )	2025-10-19 23:59:16 -07:00
Shangming Cai	a2ba0bc3df	Tiny clean up for PD module and doc (#11747 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2025-10-20 11:52:42 +08:00
Ziming Huang	6d2d0ce285	[PD] Improve eagle acceptance rate by transferring draft model hidden states (#10801 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-20 11:52:18 +08:00
Yuan Luo	271d3d0d50	Support mrope triton kernel and add unit test (#11722 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>	2025-10-20 11:51:07 +08:00
ykcombat	c4e81e64fb	[Feature] Use current greenctx stream to communicate in PD-Multiplexing. (#11594 )	2025-10-20 10:58:20 +08:00
harrisonlimh	c726d44cc7	Recapture cuda graph after model weight update to resolve IMA error (#11780 )	2025-10-20 10:50:03 +08:00
huangtingwei	cae3956585	check master server for mooncake store (#10510 )	2025-10-20 09:37:09 +08:00
Liu-congo	be0058bc05	[BugFix] replace the input_to_float8 used in dsv2 (#11612 ) Signed-off-by: Liu-congo <1502632128@qq.com>	2025-10-19 19:34:13 -05:00

1 2 3 4 5 ...

4107 Commits