sglang

Author	SHA1	Message	Date
Paiiii	efa473348b	[Spec Decoding] Support MTP for dsv3.2 (#11652 ) Co-authored-by: Paiiiiiiiiiiiiii <zengpai@baidu.com>	2025-10-19 23:44:22 +08:00
Liangsheng Yin	d658f0497e	[overlap-spec] fix stop condition and trimming (#11819 )	2025-10-19 22:00:20 +08:00
Liangsheng Yin	57e25de756	Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" (#11827 )	2025-10-19 19:44:06 +08:00
fzyzcjy	12eb02e982	Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 (#11805 )	2025-10-19 16:15:13 +08:00
fzyzcjy	002d037359	Avoid generation gets hanging when user specifies multiple event loops (#5162 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2025-10-19 16:12:49 +08:00
fzyzcjy	ce399e154c	Make single-batch overlap compatible with NextN (#11804 )	2025-10-19 16:10:44 +08:00
fzyzcjy	ea6275dfbc	Tiny add hints when users send requests to wrong place (#11808 )	2025-10-19 16:10:20 +08:00
narutolhy	eb7318f1c2	support tokenized batch request (#11091 )	2025-10-19 07:05:02 +00:00
YAMY	80407b0493	Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (#10788 )	2025-10-19 11:37:43 +08:00
Liangsheng Yin	b288f4f440	Improve `send_sone` script (#11817 )	2025-10-19 11:28:16 +08:00
tazjin	6d6ea5af0c	fix: do not wrap invalid grammar objects during constrained generation (#11328 )	2025-10-19 10:54:33 +08:00
Marin	1dacedd2db	make sure logit bias is applied during eagle spec decoding verification (#11555 )	2025-10-19 10:53:33 +08:00
ybyang	b5e14b2b78	[1/2][feature] support openai like classification api (#11618 )	2025-10-18 19:32:48 -07:00
Qiaolin Yu	ebda73dc72	Use cutlass fp4 gemm by default (#11813 )	2025-10-18 14:10:15 -07:00
b8zhong	f9a7d9b3dc	support server arg override KV cache to bf16 to avoid slow cases (#11749 )	2025-10-19 02:49:48 +08:00
Liangsheng Yin	a93f10a722	[overlap-spec] support page size > 1 (#11772 )	2025-10-19 02:09:13 +08:00
Teng Ma	585e1223f0	[HiCache] feat: add more eviction policy (#11506 )	2025-10-18 15:49:45 +00:00
fzyzcjy	a7043c6f0d	Bump torch_memory_saver to avoid installing pre-release versions (#11797 )	2025-10-18 01:20:42 -07:00
Lianmin Zheng	67e34c56d7	Fix install instructions and pyproject.tomls (#11781 )	2025-10-18 01:08:01 -07:00
Yuwei An	1d726528f7	Eager Compiler for Torch Compile (#11803 ) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>	2025-10-18 15:18:52 +08:00
Minglei Zhu	f4488e9dd9	set default attention backend for deterministic inference (#11801 )	2025-10-18 00:01:24 -07:00
Zilin Zhu	e68a2b5b2f	[RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded (#10152 )	2025-10-18 14:29:35 +08:00
Zilin Zhu	31b9f19e54	[RL] support weight update with DP attention (#11669 )	2025-10-18 14:26:19 +08:00
Jimmy	f7ab955455	fix(glm45): disable reduce scatter (#11665 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-18 12:19:20 +08:00
Chang Su	ca240eefb4	[router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798 )	2025-10-17 20:49:43 -07:00
Cheng Wan	5b214b50b6	[Refactor] move `deep_gemm_wrapper` out of `quantization` (#11784 )	2025-10-17 18:57:54 -07:00
Minglei Zhu	13219e1e48	completely remove mixed mode deterministic test as prefix mode could cover it (#11783 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-10-17 17:46:03 -07:00
fzyzcjy	33e9bbec35	Make single-batch overlap compatible with offloading (#11614 )	2025-10-18 08:45:54 +08:00
fzyzcjy	dcb8f090ad	Super tiny fix CI (#11788 )	2025-10-17 17:41:58 -07:00
Lianmin Zheng	9eefe2c0b7	Set CUDA_VISIBLE_DEVICES to achieve one GPU per process (#9170 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Cheng Wan <cwan@x.ai> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-10-17 17:30:06 -07:00
Zilin Zhu	69fe3c9726	Manually flip deepep_mode for cuda_graph (#11666 )	2025-10-18 08:05:48 +08:00
fzyzcjy	8af8491298	Support casting bf16 NextN moe to fp8 (#11613 )	2025-10-18 08:02:15 +08:00
fzyzcjy	505329cab0	Support shared experts overlap in cutlass moe (#11611 )	2025-10-18 07:59:40 +08:00
fzyzcjy	8a382fd399	Super tiny fix missing input throughput (#11607 )	2025-10-18 07:58:48 +08:00
Chang Su	627974405d	[Lint] Add `python/sglang` to ruff F401 checks and remove unused imports in files (#11685 )	2025-10-17 16:49:46 -07:00
Antonin Vidon	2614adf9ca	[Fix] Skip visual layers when applying LoRA to Qwen2VL modules (#11519 )	2025-10-17 17:39:57 -05:00
Lianmin Zheng	fdd7c69d65	[Auto Sync] Update common.py (20251017) (#11782 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-10-17 15:03:42 -07:00
Lianmin Zheng	b9a54e0968	[minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests (#11777 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>	2025-10-17 14:25:22 -07:00
Baizhou Zhang	20b8d2306c	Cleaning indexer for DeepSeek V3.2 (#11682 )	2025-10-17 13:47:21 -07:00
Yineng Zhang	b79f75fd53	[Auto Sync] Update scheduler.py (20251017) (#11738 )	2025-10-17 12:36:07 -07:00
Chunyuan WU	8fcc69e7c4	Turn on shm_allreduce and shm_allgather for fp16 (#10725 )	2025-10-17 12:35:20 -07:00
ykcombat	f440baa136	[Feature] Reuse flashinfer workspace for PD-Multiplexing. (#11540 )	2025-10-18 02:35:06 +08:00
Yineng Zhang	da681f35d3	Revert "Set csgmv as default lora backend. (#11488 )" (#11735 )	2025-10-17 12:01:36 -05:00
pdasgup	9b0f725b1d	add tuned fuse moe kernel for qwen3 235b fp8 on h200 (#11730 ) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2025-10-17 09:55:09 -07:00
Liangsheng Yin	cde5a6e30f	Abstraction for spec worker and code cleanup (#11643 )	2025-10-17 23:31:36 +08:00
Mick	3e4c7da2f5	ci: reduce and refactor vlm ut and combine test files (#11062 )	2025-10-17 15:24:50 +00:00
Liangsheng Yin	d88ac9bc9a	[overlap-spec] Make plan stream an option (#11724 )	2025-10-17 15:48:57 +08:00
Liangsheng Yin	ce11dd82dc	[CI] Try fix broken event loop init (#11746 )	2025-10-17 13:30:17 +08:00
StonyPort	fd389df96e	Reduce the image processing latency in VLM (#11541 ) Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com>	2025-10-16 15:00:03 -07:00
Baizhou Zhang	b0d1d717e1	Revert "make radix cache deterministic" (#11728 )	2025-10-16 14:36:15 -07:00

1 2 3 4 5 ...

4051 Commits