Commit Graph

4030 Commits

Author SHA1 Message Date
Zilin Zhu
e68a2b5b2f [RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded (#10152) 2025-10-18 14:29:35 +08:00
Zilin Zhu
31b9f19e54 [RL] support weight update with DP attention (#11669) 2025-10-18 14:26:19 +08:00
Jimmy
f7ab955455 fix(glm45): disable reduce scatter (#11665)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-18 12:19:20 +08:00
Chang Su
ca240eefb4 [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798) 2025-10-17 20:49:43 -07:00
Cheng Wan
5b214b50b6 [Refactor] move deep_gemm_wrapper out of quantization (#11784) 2025-10-17 18:57:54 -07:00
Minglei Zhu
13219e1e48 completely remove mixed mode deterministic test as prefix mode could cover it (#11783)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-17 17:46:03 -07:00
fzyzcjy
33e9bbec35 Make single-batch overlap compatible with offloading (#11614) 2025-10-18 08:45:54 +08:00
fzyzcjy
dcb8f090ad Super tiny fix CI (#11788) 2025-10-17 17:41:58 -07:00
Lianmin Zheng
9eefe2c0b7 Set CUDA_VISIBLE_DEVICES to achieve one GPU per process (#9170)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Cheng Wan <cwan@x.ai>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-10-17 17:30:06 -07:00
Zilin Zhu
69fe3c9726 Manually flip deepep_mode for cuda_graph (#11666) 2025-10-18 08:05:48 +08:00
fzyzcjy
8af8491298 Support casting bf16 NextN moe to fp8 (#11613) 2025-10-18 08:02:15 +08:00
fzyzcjy
505329cab0 Support shared experts overlap in cutlass moe (#11611) 2025-10-18 07:59:40 +08:00
fzyzcjy
8a382fd399 Super tiny fix missing input throughput (#11607) 2025-10-18 07:58:48 +08:00
Chang Su
627974405d [Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685) 2025-10-17 16:49:46 -07:00
Antonin Vidon
2614adf9ca [Fix] Skip visual layers when applying LoRA to Qwen2VL modules (#11519) 2025-10-17 17:39:57 -05:00
Lianmin Zheng
fdd7c69d65 [Auto Sync] Update common.py (20251017) (#11782)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-10-17 15:03:42 -07:00
Lianmin Zheng
b9a54e0968 [minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests (#11777)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2025-10-17 14:25:22 -07:00
Baizhou Zhang
20b8d2306c Cleaning indexer for DeepSeek V3.2 (#11682) 2025-10-17 13:47:21 -07:00
Yineng Zhang
b79f75fd53 [Auto Sync] Update scheduler.py (20251017) (#11738) 2025-10-17 12:36:07 -07:00
Chunyuan WU
8fcc69e7c4 Turn on shm_allreduce and shm_allgather for fp16 (#10725) 2025-10-17 12:35:20 -07:00
ykcombat
f440baa136 [Feature] Reuse flashinfer workspace for PD-Multiplexing. (#11540) 2025-10-18 02:35:06 +08:00
Yineng Zhang
da681f35d3 Revert "Set csgmv as default lora backend. (#11488)" (#11735) 2025-10-17 12:01:36 -05:00
pdasgup
9b0f725b1d add tuned fuse moe kernel for qwen3 235b fp8 on h200 (#11730)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-17 09:55:09 -07:00
Liangsheng Yin
cde5a6e30f Abstraction for spec worker and code cleanup (#11643) 2025-10-17 23:31:36 +08:00
Mick
3e4c7da2f5 ci: reduce and refactor vlm ut and combine test files (#11062) 2025-10-17 15:24:50 +00:00
Liangsheng Yin
d88ac9bc9a [overlap-spec] Make plan stream an option (#11724) 2025-10-17 15:48:57 +08:00
Liangsheng Yin
ce11dd82dc [CI] Try fix broken event loop init (#11746) 2025-10-17 13:30:17 +08:00
StonyPort
fd389df96e Reduce the image processing latency in VLM (#11541)
Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com>
2025-10-16 15:00:03 -07:00
Baizhou Zhang
b0d1d717e1 Revert "make radix cache deterministic" (#11728) 2025-10-16 14:36:15 -07:00
Simo Lin
4f24ab1718 [router][grpc] add dissag info to warm up in grpc server (#11727) 2025-10-16 14:19:55 -07:00
Mick
86b04d25b3 model: qwen3-omni (thinker-only) (#10911)
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-10-16 13:20:38 -07:00
sglang-bot
85ebeecf06 chore: bump SGLang version to 0.5.3.post3 (#11693)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2025-10-16 13:14:55 -07:00
Hank Han
0dd6cf16ba [ci]use H20 to run disaggregation test (#11543) 2025-10-16 11:42:42 -07:00
Even Zhou
3cceaa381a [Bugfix] Fix Qwen3/DSV3/DSV3.2 model support (#11510) 2025-10-16 15:14:09 +08:00
Lifu Huang
b0d20cdec7 Set csgmv as default lora backend. (#11488) 2025-10-15 23:53:24 -05:00
YanbingJiang
cbac499750 Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2025-10-15 19:22:32 -07:00
Shangming Cai
476c67d7fc Fix missing a2a backend init of GLM4.5 MoE Block (#11692)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-10-15 19:13:08 -07:00
Shangming Cai
868403f642 [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com>
2025-10-15 18:59:14 -07:00
Hanming Lu
97d857c096 [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679) 2025-10-16 09:56:43 +08:00
Lianmin Zheng
cd7e1bd591 Sync code and test CI; rename some env vars (#11686) 2025-10-15 18:37:03 -07:00
Huaiyu, Zheng
729b7edf72 enable rmsnorm on XPU (#10248) 2025-10-15 17:54:18 -07:00
DiweiSun
4c03dbaaef [CI][XPU]enable sglang CI on Intel XPU (#9493)
Co-authored-by: huaiyuzh <huaiyu.zheng@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-10-15 17:13:19 -07:00
sglang-bot
baf277a9bf chore: bump SGLang version to 0.5.3.post2 (#11680)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2025-10-15 16:49:14 -07:00
Chang Su
f226d3da2a Fix missing json imports in serving_responses.py (#11681) 2025-10-15 13:01:55 -07:00
Chang Su
30ea4c462b [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-15 09:51:51 -07:00
Shangming Cai
6d0364681c Fix 1-step draft model forward (#11653)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2025-10-15 19:11:33 +08:00
Liangsheng Yin
8221f9ae8b Tiny cleanup some eagle unused codes (#11660) 2025-10-15 17:24:08 +08:00
Stefan He
6b143d62a2 Clean up some Qwen3-Next and deterministic code (#11585) 2025-10-15 15:19:37 +08:00
Zheng Wengang
b2c8566920 [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl (#11458) 2025-10-14 22:16:49 -07:00
Yineng Zhang
91fc5bb5a9 feat: add add_chunked_prefix_cache_attention_backend (#11636) 2025-10-14 21:48:13 -07:00