sglang

Author	SHA1	Message	Date
Chang Su	627974405d	[Lint] Add `python/sglang` to ruff F401 checks and remove unused imports in files (#11685 )	2025-10-17 16:49:46 -07:00
fzyzcjy	065ce81574	Tiny cleanup fp4 gemm calls (#11537 )	2025-10-13 14:48:22 -07:00
Zhiyu	155cbb51f0	Enable native ModelOpt quantization support (1/3) (#7149 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-10-06 13:24:15 -07:00
fzyzcjy	5e786cca3a	Support single batch overlap (#10422 )	2025-10-02 18:04:36 +08:00
fzyzcjy	0b9dfba787	Support dispatch low latency (#10263 ) Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>	2025-10-02 18:02:19 +08:00
Yineng Zhang	172bcf0152	Revert "Refactor kv_cache_scheme handling for quantization (#10132 )" (#10935 )	2025-09-25 20:14:15 -07:00
Mohammad Miadh Angkad	cd4da1f19b	Refactor kv_cache_scheme handling for quantization (#10132 )	2025-09-25 10:32:15 -07:00
Yineng Zhang	f1d7892318	[Auto Sync] Update modelopt_quant.py (20250920) (#10688 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-09-20 02:37:49 -07:00
fzyzcjy	ac964d2e58	Support global scale in addition to per expert scale for cutedsl moe (#10270 )	2025-09-14 01:17:00 -07:00
fzyzcjy	2df532ef20	Fix the global scale fix does not support EPLB and improve enabling condition (#10369 )	2025-09-14 01:07:47 -07:00
fzyzcjy	efedbe6ca9	Fix global input scale incompatible with CuTe DSL moe (#10370 )	2025-09-12 03:22:49 -07:00
Trevor Morris	c7e85f5378	fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296 )	2025-09-11 20:19:17 -07:00
Shu Wang	3df05f4d6a	[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199 )	2025-09-11 20:18:43 -07:00
Pavani Majety	21176b0093	[Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-09-10 12:00:23 -07:00
Yineng Zhang	45b3a6a256	Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712 )" (#10176 )	2025-09-08 11:28:15 -07:00
Cheng Wan	3fa62da78c	[7/N] MoE Refactor: the implementation of new framework (#9269 )	2025-09-05 21:09:09 -07:00
jingyu-ml	bcbeed714f	Qwen FP8/NVFP4 ModelOPT Quantization support (#7912 ) Co-authored-by: Jingyu Xin <jingyux@nvidia.com>	2025-09-02 20:56:03 -07:00
Pavani Majety	fcd72bd100	[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-08-29 17:13:52 -07:00
fzyzcjy	9dcdf5da03	Tiny fix wrong comments (#9589 )	2025-08-25 03:08:10 -07:00
fzyzcjy	433266c125	Reintroduce memory usage fix (#9535 )	2025-08-25 00:02:31 -07:00
Trevor Morris	a91e90d9a3	[2/2] Fuse routed scaling factor into select_experts (#8690 )	2025-08-20 15:10:16 -07:00
Zhiyu	2256d62d36	Modelopt quant config adaptation (#8829 )	2025-08-18 11:27:30 -07:00
fzyzcjy	b498cd21d7	Tiny make fp4 moe method parameters more static (#8520 )	2025-08-17 13:26:02 -07:00
Trevor Morris	eff4eb3fdd	Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667 )	2025-08-15 22:08:11 -07:00
Cheng Wan	295895120d	[6/N] MoE Refactor: Cleanup MoE-related configs (#8849 )	2025-08-14 21:14:53 -07:00
Alex Yang	1bc183c6de	Faster weight processing (trtllm-gen moe nvfp4) (#9162 )	2025-08-13 21:09:34 -07:00
Jianwei Dong	83262dcb29	Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766 )	2025-08-11 23:44:40 -07:00
Lianmin Zheng	9a44b643c6	Fix CI (#9012 )	2025-08-09 13:33:42 -07:00
Shu Wang	288ae41f7a	[NVIDIA] Fix num_experts in modelopt_quant (#8811 )	2025-08-06 14:35:07 -07:00
Yineng Zhang	5e91fed1c5	Revert "[NVIDIA]Fix local_num_experts for EP (#8779 )" (#8797 )	2025-08-04 23:30:43 -07:00
Shu Wang	b01eeb80f8	[NVIDIA]Fix local_num_experts for EP (#8779 )	2025-08-04 22:01:14 -07:00
Trevor Morris	9bd4872a34	[bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' (#8768 )	2025-08-04 11:08:08 -07:00
azhurkevich	915140fd18	[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (#8552 ) Co-authored-by: Cheng Wan <cwan@x.ai>	2025-08-04 03:10:02 -07:00
fzyzcjy	7df2c0c2db	Reduce memory usage for fp4 moe (#8413 )	2025-07-28 22:51:23 -07:00
Elfie Guo	5c9c275bc8	Use FlashInfer FP4 gemm. (#8241 )	2025-07-27 01:05:22 -07:00
Kaixi Hou	85486b6f6f	[NVIDIA] Add Flashinfer MoE blockscale fp8 backend (#8036 )	2025-07-27 00:34:41 -07:00
Trevor Morris	58c468f404	Fix FP4 MoE accuracy from missing routed_scaling_factor (#8333 )	2025-07-25 16:40:23 -07:00
Cheng Wan	15ad6c9086	[1/N] MoE Refactor: refactor `select_experts` (#7966 )	2025-07-19 00:51:15 -07:00
Cheng Wan	49b8777460	Refactor: move all quantization-related code to `srt/layer/quantization` (#7989 )	2025-07-17 00:47:07 -07:00
Zhiyu	659907e32b	Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (#7129 )	2025-07-08 00:19:50 -07:00
Trevor Morris	5962e70d8d	FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327 ) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com>	2025-06-22 13:38:47 -07:00
Pavani Majety	c2c4f57f63	[DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-06-07 17:24:35 -07:00
JieXin Liang	97cb762bb6	[misc] remove is_cuda_available (#5319 )	2025-04-20 18:16:51 -07:00
Trevor Morris	11d760d56a	FP4 weight loading and inference (2/2) (#3972 )	2025-04-08 17:26:21 -07:00
Yun Dai	2695ab0537	Fix loading KV quantization scale; Enable modelopt kv cache (#4686 ) Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-04-08 09:11:35 -07:00
Xiaoyu Zhang	9b81f9bd34	sglang quant module remove vllm dependency (#4507 )	2025-03-17 15:51:59 -07:00
Lianmin Zheng	45de89719c	Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367 )	2025-03-12 23:45:52 -07:00
Meng, Hengyu	71046fcd71	[XPU][CPU] Enable the native path of DeepSeek (#4086 ) Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>	2025-03-12 22:26:29 -07:00
HandH1998	0dd6cda288	Apply sgl w8a8 fp8 kernel (#3148 )	2025-03-09 00:03:32 -08:00
Lianmin Zheng	935cda944b	Misc clean up; Remove the support of jump forward (#4032 )	2025-03-03 07:02:14 -08:00

1 2

54 Commits