sglang

Author	SHA1	Message	Date
Lianmin Zheng	9b8ebb2798	move more files under srt/utils (#11285 )	2025-10-09 16:46:15 -07:00
Cheng Wan	3c06b673af	[8/N] MoE Refactor: deprecate `EPMoE` (#11211 )	2025-10-07 21:51:41 -07:00
Bowen Bao	cd4b39a900	[quantization] Properly ignore quantization for layers excluded in quant_config (#11205 )	2025-10-07 14:06:05 -07:00
Zhiyu	155cbb51f0	Enable native ModelOpt quantization support (1/3) (#7149 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-10-06 13:24:15 -07:00
Bowen Bao	baee08601b	[quantization] Enable aiter mxfp4 fused_moe for Quark (#10048 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2025-10-05 19:51:34 -07:00
fzyzcjy	5e786cca3a	Support single batch overlap (#10422 )	2025-10-02 18:04:36 +08:00
fzyzcjy	0b9dfba787	Support dispatch low latency (#10263 ) Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>	2025-10-02 18:02:19 +08:00
DevashishLal-CB	8831c55c3d	[model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (#9642 )	2025-09-30 11:26:17 +08:00
kk	8ebf72fef3	[Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (#10981 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2025-09-26 22:12:22 -07:00
Yueyang Pan	8260574729	fix: fp8 quantization failure of qwen 2.5 VL 7B model (#10112 ) Signed-off-by: PanJason <pyyjason@gmail.com>	2025-09-27 05:05:23 +00:00
Yineng Zhang	172bcf0152	Revert "Refactor kv_cache_scheme handling for quantization (#10132 )" (#10935 )	2025-09-25 20:14:15 -07:00
Mohammad Miadh Angkad	cd4da1f19b	Refactor kv_cache_scheme handling for quantization (#10132 )	2025-09-25 10:32:15 -07:00
Xiaoyu Zhang	c4e314f986	Restruct sgl-kernel benchmark (#10861 )	2025-09-25 07:45:25 +08:00
Lianmin Zheng	f47a2c67e6	[Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (#10825 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-09-23 16:48:12 -07:00
Lianmin Zheng	113f8f65a2	[Auto Sync] Update configurer.py (20250923) (#10765 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-09-22 17:34:34 -07:00
ronnie_zheng	e22f3a5ec9	[Ascend]optimize Qwen3 on Ascend (#10574 ) Co-authored-by: c30031083 <chenxu140@huawei.com>	2025-09-22 17:18:36 -07:00
Yineng Zhang	f1d7892318	[Auto Sync] Update modelopt_quant.py (20250920) (#10688 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-09-20 02:37:49 -07:00
yhyang201	388c05d544	Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (#10579 )	2025-09-18 11:44:43 -07:00
sogalin	c32fb7a24d	[ROCm] Fix fp8 quantization accuracy issue. (#10558 )	2025-09-17 17:44:59 -07:00
b8zhong	b2435be682	Cache the result of `is_blackwell` platform check (#10498 )	2025-09-15 22:30:28 -07:00
scut-cbq	a220c14f81	fix crash of DeepSeek-V3 update_weights_from_disk (#8863 ) Co-authored-by: parkeychen <parkeychen@tencent.com>	2025-09-16 09:45:15 +08:00
Cheng Wan	2f8ba6fe82	[Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path (#10429 )	2025-09-14 17:34:28 -07:00
fzyzcjy	ac964d2e58	Support global scale in addition to per expert scale for cutedsl moe (#10270 )	2025-09-14 01:17:00 -07:00
fzyzcjy	fa46e2bd40	Support offloading in fp8 (#9948 )	2025-09-14 01:14:28 -07:00
fzyzcjy	2df532ef20	Fix the global scale fix does not support EPLB and improve enabling condition (#10369 )	2025-09-14 01:07:47 -07:00
fzyzcjy	efedbe6ca9	Fix global input scale incompatible with CuTe DSL moe (#10370 )	2025-09-12 03:22:49 -07:00
Trevor Morris	c7e85f5378	fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296 )	2025-09-11 20:19:17 -07:00
Shu Wang	3df05f4d6a	[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199 )	2025-09-11 20:18:43 -07:00
Yineng Zhang	6d55f60e77	Revert "[1/2] Optimizations and refactors about quant kernel (#9534 )" (#10292 )	2025-09-10 18:24:23 -07:00
Pavani Majety	21176b0093	[Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-09-10 12:00:23 -07:00
Yineng Zhang	45b3a6a256	Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712 )" (#10176 )	2025-09-08 11:28:15 -07:00
Yineng Zhang	b7d1f17b8d	Revert "enable auto-round quantization model (#6226 )" (#10148 )	2025-09-07 22:31:11 -07:00
Weiwei	c8295d2353	enable auto-round quantization model (#6226 ) Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>	2025-09-07 22:05:35 -07:00
Even Zhou	b67c277f86	[Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013 )	2025-09-07 21:50:49 -07:00
Cao E	7577f0e40f	Add graph runner support with torch compile on CPU (#7843 )	2025-09-07 21:33:58 -07:00
Rain Jiang	7802586cab	fix the fp8 topk_config.correction_bias is none bug (#10040 )	2025-09-07 20:28:14 -07:00
Cheng Wan	5a7e10fe4c	[MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 (#10144 )	2025-09-07 19:43:59 -07:00
Cheng Wan	a5a03209e9	Fix circular import (#10107 )	2025-09-06 01:34:17 -07:00
Kaixi Hou	9a719b7afc	[NVIDIA] Remove unused `get_fused_moe_impl_class` function (#9764 )	2025-09-05 22:41:22 -07:00
Cheng Wan	3fa62da78c	[7/N] MoE Refactor: the implementation of new framework (#9269 )	2025-09-05 21:09:09 -07:00
Morpheus Guo	4efe844a25	enable aiter gemm_a8w8_bpreshuffle for ptpc gemm (#8555 )	2025-09-05 12:54:40 -07:00
fzyzcjy	5e5c30d9ab	Tiny let DeepGEMM scale checks cover more cases (#7182 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2025-09-05 19:52:32 +08:00
fzyzcjy	339f8eef09	[1/2] Optimizations and refactors about quant kernel (#9534 )	2025-09-05 18:45:08 +08:00
kk	e96973742c	Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>	2025-09-04 15:11:22 -07:00
Hubert Lu	2c562fd2d0	Fix Llama 4 with MXFP4 dynamic quant on MI35x (#9993 )	2025-09-04 00:48:58 -07:00
Yineng Zhang	1b2ff4fb7f	Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671 )" (#9959 )	2025-09-03 00:50:04 -07:00
Yineng Zhang	2c7ca33abb	Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946 )" (#9955 )	2025-09-02 23:49:56 -07:00
kk	0dfd54d11d	Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: wghuang <wghuang@amd.com>	2025-09-02 22:26:28 -07:00
jingyu-ml	bcbeed714f	Qwen FP8/NVFP4 ModelOPT Quantization support (#7912 ) Co-authored-by: Jingyu Xin <jingyux@nvidia.com>	2025-09-02 20:56:03 -07:00
Al-Ekram Elahee Hridoy	6243c36702	[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946 )	2025-09-02 19:31:15 -07:00

1 2 3 4 5 ...

334 Commits