sglang

Author	SHA1	Message	Date
Chang Su	28b8a4064d	[router][CI] Clean up imports and prints statements in sgl-router/py_test (#12024 )	2025-10-23 11:56:57 -07:00
Mick	8bd26dd4e6	ci: fix night-ci with push retry mechanism (#11765 )	2025-10-23 11:31:05 -07:00
Lianmin Zheng	ab07cd3e5a	[Auto Sync] Update test_deterministic_utils.py (20251023) (#12022 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com>	2025-10-23 11:20:45 -07:00
Netanel Haber	a98496834b	Feature/nano v2 offline modelopt fp8 and nvfp4 (#12018 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-10-23 11:16:46 -07:00
Simo Lin	a4b637d87a	[router] change ci names and update log level in ci (#12021 )	2025-10-23 10:36:19 -07:00
Teng Ma	96a5e4dd79	[Feature] Support loading weights from ckpt engine worker (#11755 ) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Co-authored-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-23 09:23:30 -07:00
cctry	b0b4f71679	[Fix] memory leak by overlap + retract (#11981 ) Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>	2025-10-23 22:59:23 +08:00
Liangsheng Yin	6c18addb6f	Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" (#12015 )	2025-10-23 21:27:58 +08:00
Liangsheng Yin	32852fe9e9	Move memory runtime checker to mixin class (#12014 )	2025-10-23 20:53:26 +08:00
Arthur Cheng	53c2934dce	[Router] Consolidate ConnectionMode enum to core module (#11937 )	2025-10-23 05:15:49 -07:00
Keyang Ru	e321c97113	[router] Add comprehensive E2E tests for Response API (#11988 )	2025-10-23 05:13:51 -07:00
Netanel Haber	d6fee73d1f	Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 (#11866 )	2025-10-23 17:29:02 +08:00
Qiaolin Yu	36a4cad7b0	Support overlap-spec-v2 with trtllm_mla attention backend (#11821 )	2025-10-23 16:55:35 +08:00
HAI	65d376b491	aiter update to v0.1.6.post1 (#12004 )	2025-10-22 23:53:05 -07:00
yinghui	c23eda8589	Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend (#11985 )	2025-10-22 22:44:45 -07:00
Jue WANG	138ff23187	Allow to disable batch decoding. (#11944 )	2025-10-22 21:57:12 -07:00
blzheng	13fb8b5489	[CPU] Optimize FP16 decode_attention_cpu (#10652 )	2025-10-22 21:39:51 -07:00
Zhengyi Lai	81fd2b0ee0	fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 (#11965 )	2025-10-22 21:20:54 -07:00
Zaili Wang	007b849b0e	[CPU] misc updates (#11906 )	2025-10-22 21:10:05 -07:00
fzyzcjy	8612811d85	Bump grace blackwell DeepEP version (#11990 )	2025-10-22 21:08:12 -07:00
Johnny	e7aa4664b3	[NVIDIA] Build CUDA 13 (#11299 ) Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-10-22 20:03:12 -07:00
b8zhong	4d4feccbb2	[ROCm] Remove vLLM rope dependency & use AITER impl (#11322 )	2025-10-22 19:17:34 -07:00
jacky.cheng	99c92ff24b	[AMD] Support a new flag to disable quant on parallelLinear layer if required (#11811 )	2025-10-22 19:16:15 -07:00
Chang Su	6ade6a02d4	[grpc] Support gRPC standard health check (#11955 )	2025-10-22 16:59:09 -07:00
Baizhou Zhang	983ef22cf3	[Doc] Update deterministic inference flag in server_arguments.md (#11978 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-22 14:12:15 -07:00
Christian Bahls	164302c7df	Implement BGE-M3 Sparse Embeddings in SGLang (#10869 ) Co-authored-by: Christian Bahls <christian.bahls@planet-ai.de> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-22 13:46:16 -07:00
Simo Lin	5dccf69713	[router] create worker removal step and clean up worker manager (#11921 )	2025-10-22 13:26:06 -07:00
jiahanc	eec9e471ca	[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563 ) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>	2025-10-22 13:11:16 -07:00
Lianmin Zheng	6d535b719f	Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980 )	2025-10-22 11:50:26 -07:00
yuho	fdcb1d13c5	[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977 )	2025-10-22 11:29:55 -07:00
Hongbo Xu	d7e834d6ba	[6/n]decouple quantization implementation from vLLM dependency (#10750 )	2025-10-23 02:07:55 +08:00
Minglei Zhu	200a3c0bb1	[Documentation] add doc for deterministic inference (#11956 )	2025-10-22 12:36:15 -05:00
Keyang Ru	77258ce039	[router] Support multiple worker URLs for OpenAI router (#11723 )	2025-10-22 09:27:58 -07:00
Fan Yin	1d097aac87	[Fix] Remove unused import from triton_kernels_moe.py (#11967 ) Co-authored-by: Shangming Cai <171321666+shangmingcai@users.noreply.github.com>	2025-10-22 21:02:57 +08:00
Shangming Cai	7fceeef599	Fix flaky hicache test with mooncake backend (#11953 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2025-10-22 21:00:47 +08:00
996_icu	88568c01eb	[model] Support POINTSV15Chat (#9651 ) Co-authored-by: josephyou <josephyou@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: root <root@TENCENT64.site>	2025-10-22 16:58:17 +08:00
Hank Han	904655c5fd	[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606 ) Co-authored-by: Xun Sun <UNIDY2002@outlook.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-22 01:13:31 -07:00
Xun Sun	e028af6998	Fix mooncake dispatcher (#11908 )	2025-10-22 01:11:49 -07:00
Zhiyu	80b2b3207a	Enable native ModelOpt quantization support (3/3) (#10154 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-10-21 21:44:29 -07:00
Johnny	4b65ed42cc	[NVIDIA] upstream FA4 and fix cccl path (#11929 )	2025-10-21 21:18:25 -07:00
Fan Yin	23afdfd1c2	[sgl-kernel] support flashmla libtorch (#11717 )	2025-10-21 21:17:50 -07:00
Liangsheng Yin	9d61205dac	[lint] improve ruff check (#11922 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2025-10-22 11:32:50 +08:00
Chang Su	590bc4b7a7	[router][grpc] Fix background tasks stored with wrong id (#11945 )	2025-10-21 18:38:51 -07:00
Keyang Ru	63cfe1b032	[router] Add gRPC E2E test suite (#11790 )	2025-10-21 17:51:21 -07:00
Chang Su	70f6309cd4	[router][grpc] Support `v1/responses` API (#11926 )	2025-10-21 17:41:48 -07:00
Yineng Zhang	704160017d	fix: resolve flashinfer 0.4.1 import (#11940 )	2025-10-21 17:19:57 -07:00
Keyang Ru	87a92e459a	Fix openai input_text type compatibility (#11935 )	2025-10-21 16:10:35 -07:00
Yineng Zhang	c461e7714d	[Auto Sync] Update forward_batch_info.py (20251021) (#11934 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: yinghui <32845984+cicirori@users.noreply.github.com>	2025-10-21 15:52:15 -07:00
Zheng Wengang	fde2decf8b	[BugFix][Qwen3-VL]: add metadata for video in qwen3-vl (#11377 )	2025-10-21 15:36:01 -07:00
Yineng Zhang	9792b9d7e3	chore: upgrade flashinfer 0.4.1 (#11933 )	2025-10-21 14:46:31 -07:00

1 2 3 4 5 ...

6167 Commits