Commit Graph

4048 Commits

Author SHA1 Message Date
fzyzcjy
12eb02e982 Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 (#11805) 2025-10-19 16:15:13 +08:00
fzyzcjy
002d037359 Avoid generation gets hanging when user specifies multiple event loops (#5162)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-10-19 16:12:49 +08:00
fzyzcjy
ce399e154c Make single-batch overlap compatible with NextN (#11804) 2025-10-19 16:10:44 +08:00
fzyzcjy
ea6275dfbc Tiny add hints when users send requests to wrong place (#11808) 2025-10-19 16:10:20 +08:00
narutolhy
eb7318f1c2 support tokenized batch request (#11091) 2025-10-19 07:05:02 +00:00
YAMY
80407b0493 Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (#10788) 2025-10-19 11:37:43 +08:00
Liangsheng Yin
b288f4f440 Improve send_sone script (#11817) 2025-10-19 11:28:16 +08:00
tazjin
6d6ea5af0c fix: do not wrap invalid grammar objects during constrained generation (#11328) 2025-10-19 10:54:33 +08:00
Marin
1dacedd2db make sure logit bias is applied during eagle spec decoding verification (#11555) 2025-10-19 10:53:33 +08:00
ybyang
b5e14b2b78 [1/2][feature] support openai like classification api (#11618) 2025-10-18 19:32:48 -07:00
Qiaolin Yu
ebda73dc72 Use cutlass fp4 gemm by default (#11813) 2025-10-18 14:10:15 -07:00
b8zhong
f9a7d9b3dc support server arg override KV cache to bf16 to avoid slow cases (#11749) 2025-10-19 02:49:48 +08:00
Liangsheng Yin
a93f10a722 [overlap-spec] support page size > 1 (#11772) 2025-10-19 02:09:13 +08:00
Teng Ma
585e1223f0 [HiCache] feat: add more eviction policy (#11506) 2025-10-18 15:49:45 +00:00
fzyzcjy
a7043c6f0d Bump torch_memory_saver to avoid installing pre-release versions (#11797) 2025-10-18 01:20:42 -07:00
Lianmin Zheng
67e34c56d7 Fix install instructions and pyproject.tomls (#11781) 2025-10-18 01:08:01 -07:00
Yuwei An
1d726528f7 Eager Compiler for Torch Compile (#11803)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
2025-10-18 15:18:52 +08:00
Minglei Zhu
f4488e9dd9 set default attention backend for deterministic inference (#11801) 2025-10-18 00:01:24 -07:00
Zilin Zhu
e68a2b5b2f [RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded (#10152) 2025-10-18 14:29:35 +08:00
Zilin Zhu
31b9f19e54 [RL] support weight update with DP attention (#11669) 2025-10-18 14:26:19 +08:00
Jimmy
f7ab955455 fix(glm45): disable reduce scatter (#11665)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-18 12:19:20 +08:00
Chang Su
ca240eefb4 [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798) 2025-10-17 20:49:43 -07:00
Cheng Wan
5b214b50b6 [Refactor] move deep_gemm_wrapper out of quantization (#11784) 2025-10-17 18:57:54 -07:00
Minglei Zhu
13219e1e48 completely remove mixed mode deterministic test as prefix mode could cover it (#11783)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-17 17:46:03 -07:00
fzyzcjy
33e9bbec35 Make single-batch overlap compatible with offloading (#11614) 2025-10-18 08:45:54 +08:00
fzyzcjy
dcb8f090ad Super tiny fix CI (#11788) 2025-10-17 17:41:58 -07:00
Lianmin Zheng
9eefe2c0b7 Set CUDA_VISIBLE_DEVICES to achieve one GPU per process (#9170)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Cheng Wan <cwan@x.ai>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-10-17 17:30:06 -07:00
Zilin Zhu
69fe3c9726 Manually flip deepep_mode for cuda_graph (#11666) 2025-10-18 08:05:48 +08:00
fzyzcjy
8af8491298 Support casting bf16 NextN moe to fp8 (#11613) 2025-10-18 08:02:15 +08:00
fzyzcjy
505329cab0 Support shared experts overlap in cutlass moe (#11611) 2025-10-18 07:59:40 +08:00
fzyzcjy
8a382fd399 Super tiny fix missing input throughput (#11607) 2025-10-18 07:58:48 +08:00
Chang Su
627974405d [Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685) 2025-10-17 16:49:46 -07:00
Antonin Vidon
2614adf9ca [Fix] Skip visual layers when applying LoRA to Qwen2VL modules (#11519) 2025-10-17 17:39:57 -05:00
Lianmin Zheng
fdd7c69d65 [Auto Sync] Update common.py (20251017) (#11782)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-10-17 15:03:42 -07:00
Lianmin Zheng
b9a54e0968 [minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests (#11777)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2025-10-17 14:25:22 -07:00
Baizhou Zhang
20b8d2306c Cleaning indexer for DeepSeek V3.2 (#11682) 2025-10-17 13:47:21 -07:00
Yineng Zhang
b79f75fd53 [Auto Sync] Update scheduler.py (20251017) (#11738) 2025-10-17 12:36:07 -07:00
Chunyuan WU
8fcc69e7c4 Turn on shm_allreduce and shm_allgather for fp16 (#10725) 2025-10-17 12:35:20 -07:00
ykcombat
f440baa136 [Feature] Reuse flashinfer workspace for PD-Multiplexing. (#11540) 2025-10-18 02:35:06 +08:00
Yineng Zhang
da681f35d3 Revert "Set csgmv as default lora backend. (#11488)" (#11735) 2025-10-17 12:01:36 -05:00
pdasgup
9b0f725b1d add tuned fuse moe kernel for qwen3 235b fp8 on h200 (#11730)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-17 09:55:09 -07:00
Liangsheng Yin
cde5a6e30f Abstraction for spec worker and code cleanup (#11643) 2025-10-17 23:31:36 +08:00
Mick
3e4c7da2f5 ci: reduce and refactor vlm ut and combine test files (#11062) 2025-10-17 15:24:50 +00:00
Liangsheng Yin
d88ac9bc9a [overlap-spec] Make plan stream an option (#11724) 2025-10-17 15:48:57 +08:00
Liangsheng Yin
ce11dd82dc [CI] Try fix broken event loop init (#11746) 2025-10-17 13:30:17 +08:00
StonyPort
fd389df96e Reduce the image processing latency in VLM (#11541)
Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com>
2025-10-16 15:00:03 -07:00
Baizhou Zhang
b0d1d717e1 Revert "make radix cache deterministic" (#11728) 2025-10-16 14:36:15 -07:00
Simo Lin
4f24ab1718 [router][grpc] add dissag info to warm up in grpc server (#11727) 2025-10-16 14:19:55 -07:00
Mick
86b04d25b3 model: qwen3-omni (thinker-only) (#10911)
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-10-16 13:20:38 -07:00
sglang-bot
85ebeecf06 chore: bump SGLang version to 0.5.3.post3 (#11693)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2025-10-16 13:14:55 -07:00