sglang

Author	SHA1	Message	Date
shaharmor98	8f2cd177af	add code pp support for nixl (#11375 ) Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-10-09 19:24:32 +08:00
Chang Su	ab926dd697	[router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (#11373 )	2025-10-09 06:53:23 -04:00
Trevor Morris	a4b424c632	[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (#11309 )	2025-10-08 23:59:46 -07:00
Chang Su	a0557642ea	[router][lint] Add unused_qualifications to cargo lint warnings (#11366 )	2025-10-08 22:17:11 -07:00
Keyang Ru	84768d1017	[router] Refactor OpenAI router: split monolithic file and move location (#11359 )	2025-10-09 00:46:39 -04:00
Simo Lin	368fd20622	[router][grpc] disable health check generation and increase timeout (#11353 )	2025-10-08 19:23:08 -07:00
Sundara Raman Ramachandran	53bd00d975	[Generative Score API] Multi-Item scoring with custom attention mask. (#10979 )	2025-10-08 18:47:32 -07:00
Yineng Zhang	e22b13c569	[Auto Sync] Update scheduler.py (20251009) (#11350 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Junxiong Wang <junxiong@together.ai>	2025-10-08 17:39:04 -07:00
Mick	a3c2ea4451	fix: fix revision for sgl-flash-attn in sgl-kernel (#11327 )	2025-10-08 15:50:44 -07:00
Chang Su	fccac7d126	[router][grpc] Add dependencies in Cargo.toml to support chat template rendering (#11342 )	2025-10-08 15:38:37 -07:00
Keyang Ru	7ac6b900f4	[router] Support history management using conversation (#11339 )	2025-10-08 15:24:02 -07:00
Chang Su	a1080b72a0	[router] Fix all unused_qualifications (#11341 )	2025-10-08 13:55:27 -07:00
Chang Su	a65ca73911	[router][grpc] Cleanup debug logs in grpc_server and grpc_router (#11340 )	2025-10-08 13:26:19 -07:00
Simo Lin	677aa0e25f	[router] improve reasoning parser lock and reduce req cloning (#11336 )	2025-10-08 11:18:15 -07:00
Simo Lin	01c9ee1ab4	[router] refactor generate to use new pipeline arch (#11323 )	2025-10-08 09:38:50 -07:00
Netanel Haber	d6837aea4d	model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-10-09 00:37:38 +08:00
Liangsheng Yin	c882b5ae75	[CI] improve disaggregation CI. (#11264 ) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-08 21:40:56 +08:00
Kevin Xiang Li	e3bb7f5ae6	benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812 ) Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2025-10-08 01:31:36 -07:00
Lifu Huang	92473e2e34	Support LoRA in bench_serving oai interface (#11318 )	2025-10-08 01:28:58 -07:00
JinYan Su	6c0bb32711	fix(decode): adjust ServerArgs import to explicit module path (#11007 )	2025-10-08 01:27:50 -07:00
Shangming Cai	0a7c4bded7	[Doc] Update mooncake nvlink transport doc for PD disaggregation (#11321 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2025-10-08 00:59:29 -07:00
Lifu Huang	edefab0c64	[2/2] Support MHA prefill with FlashAttention 4. (#10937 ) Co-authored-by: Hieu Pham <hyhieu@gmail.com>	2025-10-08 00:54:20 -07:00
Cheng Wan	97cd38e58d	Skip weight loading in deepgemm compilation (#11312 )	2025-10-07 21:52:46 -07:00
Cheng Wan	3c06b673af	[8/N] MoE Refactor: deprecate `EPMoE` (#11211 )	2025-10-07 21:51:41 -07:00
Adarsh Shirawalmath	7c3f07dbcb	[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (#9545 )	2025-10-08 12:38:48 +08:00
Chang Su	edd86b8853	[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (#11314 ) Co-authored-by: Simo Lin <linsimo.mark@gmail.com>	2025-10-07 20:50:20 -07:00
Liangsheng Yin	4b4dc132fa	Rename `ngram_utils` -> `ngram_info` (#11316 )	2025-10-08 11:49:46 +08:00
YAMY	5a9170d993	Optimize copy_kv_cache for spec decoding (#11126 ) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2025-10-08 10:43:30 +08:00
Xinyuan Tong	c4d77774e1	update sampling_params documentation with defaults (#11315 )	2025-10-07 18:36:26 -07:00
DarkSharpness	832c84fba9	[Chore] Update xgrammar 0.1.24 -> 0.1.25 (#10710 )	2025-10-07 18:22:28 -07:00
Mick	64d1505c0a	ci: unify the model launch method of nightly ci (#11230 )	2025-10-07 18:13:14 -07:00
cctry	f3764c26a3	Clean match_prefix and prepare_for_extend for mem cache V2 (#11200 )	2025-10-07 17:54:18 -07:00
Chang Su	7ba3de0e92	[oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (#11304 )	2025-10-08 00:36:05 +00:00
Simo Lin	fde9b96392	[router] cleanup worker health check to return early (#11310 )	2025-10-07 16:53:10 -07:00
Chang Su	f094e0a490	[router][grpc] Fix request_id extraction when n > 1 (#11311 )	2025-10-07 19:27:56 -04:00
Keyang Ru	4ed67c27e3	[router] support Openai router conversation API CRUD (#11297 )	2025-10-07 15:31:35 -07:00
Bowen Bao	cd4b39a900	[quantization] Properly ignore quantization for layers excluded in quant_config (#11205 )	2025-10-07 14:06:05 -07:00
Chang Su	420c99acfe	[router][grpc] Fix error message format in grpc chat handler (#11307 )	2025-10-07 13:54:02 -07:00
Xinyuan Tong	e3c7f09146	Update tool parser and related documentation (#11223 )	2025-10-07 11:03:40 -07:00
Chang Su	6f1e03a456	[router][grpc] Fix sampling_params.stop_strs is None (#11306 )	2025-10-07 10:57:38 -07:00
Simo Lin	f4affd4df5	[router] fix grpc connection conversion and add optimization (#11305 )	2025-10-07 10:39:33 -07:00
hzh0425	df08bf9b9f	[Doc]: Best Practice for HICache (#11001 ) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-10-08 00:59:21 +08:00
ykwd	69efdd27bc	[Doc] HiCache Design Documents (#11027 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-10-08 00:35:45 +08:00
Chang Su	64582caa84	[router][grpc] Refactor chat template content format detection (#11288 )	2025-10-07 08:38:51 -07:00
Simo Lin	2fcd56eaf6	[router] add get server info and get model info in grpc server (#11303 )	2025-10-07 08:36:52 -07:00
Wenyi Xu	0958a39704	[Docs] [Router] Update Observability and Common Issues Section (#11302 )	2025-10-07 08:03:09 -07:00
Yuan Luo	4f42c8cd3e	[sgl-kernel] Support float64 moe_sum_reduce cuda kernel (#11068 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-07 14:31:11 +00:00
Liangsheng Yin	3ddd7dc9f8	Introduce future indices (#11301 )	2025-10-07 22:24:02 +08:00
Liangsheng Yin	501dfa6b42	Remove sampling info events and overlap thread file (#11300 )	2025-10-07 21:34:25 +08:00
Simo Lin	79d3495177	[router] add reasoning and tool parser argument in router (#11290 )	2025-10-07 09:08:32 -04:00

1 2 3 4 5 ...

5791 Commits