sglang

Author	SHA1	Message	Date
pranavm-nvidia	64574ef8c0	Enables speculative decoding for the trtllm_mla attention backend (#9238 )	2025-08-21 01:18:21 -07:00
Nicolas Castet	c10b8e6a0f	Support DP attention with GPT-OSS (#9359 )	2025-08-20 16:36:31 -07:00
Lifu Huang	b0980af89f	Support pinning adapter via server args. (#9249 )	2025-08-20 16:25:01 -07:00
fzyzcjy	42c8704560	Add PDL support for quant kernel and rope kernel (#9106 )	2025-08-20 01:56:29 -07:00
Lianmin Zheng	1ec9769753	[Docs] Update contribution guide (#9383 )	2025-08-19 23:37:45 -07:00
Keyang Ru	886454e8e7	[MISC] use dynamic choices for tool-call-parser argument (#9316 )	2025-08-18 13:02:10 -07:00
kk	1c1f8a118e	Combine fp4.py and mxfp4.py into one file and support dynamic mxfp4 quantization in mxfp4.py (#9049 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2025-08-16 19:01:54 -07:00
Trevor Morris	eff4eb3fdd	Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667 )	2025-08-15 22:08:11 -07:00
Cheng Wan	e3e75a786a	Fix the deprecation warning for enable_flashinfer_mxfp4_moe (#9214 )	2025-08-14 23:59:35 -07:00
Cheng Wan	295895120d	[6/N] MoE Refactor: Cleanup MoE-related configs (#8849 )	2025-08-14 21:14:53 -07:00
Yineng Zhang	27985c27aa	feat: update model config (#9202 )	2025-08-14 15:15:27 -07:00
Chengxing Xie	c1c7dc4534	feat: Add model version tracking with API endpoints and response metadata (#8795 )	2025-08-14 12:13:46 -07:00
Lianmin Zheng	9e426466af	Clean up allocators (#9134 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-13 13:56:04 -07:00
Ke Bao	0ff6d1fce1	Support FA3 backend for gpt-oss (#9028 )	2025-08-13 10:41:50 -07:00
huangtingwei	0edda32001	Support page first layout zero copy for mooncake store (#8651 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-08-12 15:59:26 -07:00
jacky.cheng	25caa7a8a9	[AMD] Support Wave attention backend with AMD GPU optimizations (#8660 ) Signed-off-by: Stanley Winata <stanley.winata@amd.com> Signed-off-by: Harsh Menon <harsh@nod-labs.com> Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Harsh Menon <harsh@nod-labs.com> Co-authored-by: Stanley Winata <stanley.winata@amd.com> Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Co-authored-by: Stanley Winata <stanley@nod-labs.com> Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com> Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com> Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com> Co-authored-by: Ivan Butygin <ibutygin@amd.com>	2025-08-12 13:49:11 -07:00
Chang Su	a218490136	(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support (#9043 )	2025-08-11 18:59:18 -07:00
Faraz	f508cd3cb7	TRTLLM-MLA FP8 path (#8638 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-08-11 14:02:13 -07:00
Xiaoyu Zhang	44e86480e8	fuse allreduce and residual_rmsnorm (#8731 )	2025-08-11 13:50:53 -07:00
633WHU	3bffe11279	Fix chunked prefill size validation for disabled state (#8973 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-11 11:05:29 -07:00
Baizhou Zhang	75e6a7cde1	Support radix cache for Lora feature (#7216 )	2025-08-11 10:14:11 -07:00
Lianmin Zheng	2449a0afe2	Refactor the docs (#9031 )	2025-08-10 19:49:45 -07:00
Lianmin Zheng	706bd69cc5	Clean up server_args.py to have a dedicated function for model specific adjustments (#8983 )	2025-08-08 19:56:50 -07:00
Lianmin Zheng	a947154286	Revert "Support Multi Process Tokenizer Manager" (#8960 )	2025-08-08 02:28:27 -07:00
pansicheng	e2fd2b9c7e	Simple prefetch policy (#8692 )	2025-08-08 02:09:28 -07:00
ybyang	7490e3f67d	Support Multi Process Tokenizer Manager (#6555 ) Signed-off-by: ybyang <ybyang7@iflytek.com> Signed-off-by: huanglong <huanglong@linux.alibaba.com> Co-authored-by: lw9527 <952799980@qq.com> Co-authored-by: huanglong <huanglong@linux.alibaba.com> Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>	2025-08-08 01:45:50 -07:00
Cheng Wan	1d24db8348	Expert Parallelism for GPT-OSS (#8944 )	2025-08-08 00:46:42 -07:00
Xiaoyu Zhang	0d1e27a0c5	Better optimization log for gpt-oss model (#8953 )	2025-08-08 00:11:48 -07:00
fzyzcjy	774b47f3f1	Reduce scheduler recv requests overhead (#8947 )	2025-08-08 00:10:05 -07:00
Xiaoyu Zhang	76915d68a8	Fix enable flashinfer mxfp4 moe bf16 check (#8950 )	2025-08-07 22:52:09 -07:00
Xiaoyu Zhang	47824c1488	[Perf] Auto enable best flashinfer mxfp4 kernel in b200 (#8898 )	2025-08-07 01:08:41 -07:00
Xinyuan Tong	c36a6693f3	Disable gemma3 for SWA due to CUDA illegal memory access error (#8895 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2025-08-07 00:44:44 -07:00
PGFLMG	b7cd743038	[Feat] QWen-1M context support[2/2]: Update block sparse attention backend (#5949 )	2025-08-06 23:49:36 -07:00
Lifu Huang	6210e2c4f0	Support GPU pinning for LoRA (#8697 )	2025-08-06 19:39:45 -07:00
eigen	6ad6c8c9e6	feat: openai oss attention sink support with trtllm-gen backend #8825 (#8834 ) Co-authored-by: averyhuang <averyh@nvidia.com>	2025-08-06 19:18:27 -07:00
Xiaoyu Zhang	4373df5525	add flashinfer mxfp4 (#8847 )	2025-08-06 16:23:41 -07:00
Chang Su	92cc32d9fc	Support v1/responses and use harmony in serving_chat (#8837 ) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2025-08-06 16:20:34 -07:00
Ying Sheng	168033d5fb	Support mxfp4 for GPT-OSS (#8843 ) Co-authored-by: Co-author fzyzcjy <ch271828n@outlook.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com> Co-authored-by: liz-badada <jinyanc@nvidia.com> Co-authored-by: xutizhou <xutingz@nvidia.com> Co-authored-by: linhu-nv <linhu@nvidia.com>	2025-08-06 00:05:25 -07:00
HouseWest	ca47e24f5d	[Feature] improve TBO: two chunk overlap (#8144 )	2025-08-05 21:11:01 -07:00
Ke Bao	8128e08d36	Turn off hybrid cache by default (#8839 )	2025-08-06 09:53:45 +08:00
Ying Sheng	c1d2061f97	Add initial support for gpt-oss (#8824 )	2025-08-05 13:42:01 -07:00
eigen	40e3b2beeb	feat: add trtllm-gen mha from direct call (#8782 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-08-05 03:28:39 -07:00
kk	d4bf5a8524	Support OCP MXFP4 quantization on AMD GPUs (#8255 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>	2025-08-04 18:14:52 -07:00
Lifu Huang	7cb20754fa	[Fix] Fix several issues preventing gemma3n LoRA support. (#8776 )	2025-08-04 17:11:46 -07:00
azhurkevich	915140fd18	[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (#8552 ) Co-authored-by: Cheng Wan <cwan@x.ai>	2025-08-04 03:10:02 -07:00
Yingchun Lai	ed6f7597b3	Fix the missing 'lof' choice of --schedule-policy server args (#7114 )	2025-08-03 12:29:42 -07:00
Guanhua Wang	f7b2853ff8	[feat] support minimum token load balance in dp attention (#7379 )	2025-08-03 00:46:47 -07:00
Lifu Huang	8675bdf246	Support limiting max loaded loras in CPU. (#8650 )	2025-08-03 00:02:23 -07:00
Lianmin Zheng	e314b084c5	[FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (#8693 )	2025-08-02 18:43:14 -07:00
Nicolas Castet	82e6c3a65a	Add support for NCCL symmetric memory for TP allreduces (#8238 )	2025-08-01 23:30:55 +00:00

1 2 3 4 5 ...

450 Commits