Commit Graph

54 Commits

Author SHA1 Message Date
Chang Su
627974405d [Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685) 2025-10-17 16:49:46 -07:00
fzyzcjy
065ce81574 Tiny cleanup fp4 gemm calls (#11537) 2025-10-13 14:48:22 -07:00
Zhiyu
155cbb51f0 Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-10-06 13:24:15 -07:00
fzyzcjy
5e786cca3a Support single batch overlap (#10422) 2025-10-02 18:04:36 +08:00
fzyzcjy
0b9dfba787 Support dispatch low latency (#10263)
Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>
2025-10-02 18:02:19 +08:00
Yineng Zhang
172bcf0152 Revert "Refactor kv_cache_scheme handling for quantization (#10132)" (#10935) 2025-09-25 20:14:15 -07:00
Mohammad Miadh Angkad
cd4da1f19b Refactor kv_cache_scheme handling for quantization (#10132) 2025-09-25 10:32:15 -07:00
Yineng Zhang
f1d7892318 [Auto Sync] Update modelopt_quant.py (20250920) (#10688)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-09-20 02:37:49 -07:00
fzyzcjy
ac964d2e58 Support global scale in addition to per expert scale for cutedsl moe (#10270) 2025-09-14 01:17:00 -07:00
fzyzcjy
2df532ef20 Fix the global scale fix does not support EPLB and improve enabling condition (#10369) 2025-09-14 01:07:47 -07:00
fzyzcjy
efedbe6ca9 Fix global input scale incompatible with CuTe DSL moe (#10370) 2025-09-12 03:22:49 -07:00
Trevor Morris
c7e85f5378 fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296) 2025-09-11 20:19:17 -07:00
Shu Wang
3df05f4d6a [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199) 2025-09-11 20:18:43 -07:00
Pavani Majety
21176b0093 [Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-09-10 12:00:23 -07:00
Yineng Zhang
45b3a6a256 Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176) 2025-09-08 11:28:15 -07:00
Cheng Wan
3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) 2025-09-05 21:09:09 -07:00
jingyu-ml
bcbeed714f Qwen FP8/NVFP4 ModelOPT Quantization support (#7912)
Co-authored-by: Jingyu Xin <jingyux@nvidia.com>
2025-09-02 20:56:03 -07:00
Pavani Majety
fcd72bd100 [ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-08-29 17:13:52 -07:00
fzyzcjy
9dcdf5da03 Tiny fix wrong comments (#9589) 2025-08-25 03:08:10 -07:00
fzyzcjy
433266c125 Reintroduce memory usage fix (#9535) 2025-08-25 00:02:31 -07:00
Trevor Morris
a91e90d9a3 [2/2] Fuse routed scaling factor into select_experts (#8690) 2025-08-20 15:10:16 -07:00
Zhiyu
2256d62d36 Modelopt quant config adaptation (#8829) 2025-08-18 11:27:30 -07:00
fzyzcjy
b498cd21d7 Tiny make fp4 moe method parameters more static (#8520) 2025-08-17 13:26:02 -07:00
Trevor Morris
eff4eb3fdd Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667) 2025-08-15 22:08:11 -07:00
Cheng Wan
295895120d [6/N] MoE Refactor: Cleanup MoE-related configs (#8849) 2025-08-14 21:14:53 -07:00
Alex Yang
1bc183c6de Faster weight processing (trtllm-gen moe nvfp4) (#9162) 2025-08-13 21:09:34 -07:00
Jianwei Dong
83262dcb29 Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766) 2025-08-11 23:44:40 -07:00
Lianmin Zheng
9a44b643c6 Fix CI (#9012) 2025-08-09 13:33:42 -07:00
Shu Wang
288ae41f7a [NVIDIA] Fix num_experts in modelopt_quant (#8811) 2025-08-06 14:35:07 -07:00
Yineng Zhang
5e91fed1c5 Revert "[NVIDIA]Fix local_num_experts for EP (#8779)" (#8797) 2025-08-04 23:30:43 -07:00
Shu Wang
b01eeb80f8 [NVIDIA]Fix local_num_experts for EP (#8779) 2025-08-04 22:01:14 -07:00
Trevor Morris
9bd4872a34 [bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' (#8768) 2025-08-04 11:08:08 -07:00
azhurkevich
915140fd18 [NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (#8552)
Co-authored-by: Cheng Wan <cwan@x.ai>
2025-08-04 03:10:02 -07:00
fzyzcjy
7df2c0c2db Reduce memory usage for fp4 moe (#8413) 2025-07-28 22:51:23 -07:00
Elfie Guo
5c9c275bc8 Use FlashInfer FP4 gemm. (#8241) 2025-07-27 01:05:22 -07:00
Kaixi Hou
85486b6f6f [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (#8036) 2025-07-27 00:34:41 -07:00
Trevor Morris
58c468f404 Fix FP4 MoE accuracy from missing routed_scaling_factor (#8333) 2025-07-25 16:40:23 -07:00
Cheng Wan
15ad6c9086 [1/N] MoE Refactor: refactor select_experts (#7966) 2025-07-19 00:51:15 -07:00
Cheng Wan
49b8777460 Refactor: move all quantization-related code to srt/layer/quantization (#7989) 2025-07-17 00:47:07 -07:00
Zhiyu
659907e32b Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (#7129) 2025-07-08 00:19:50 -07:00
Trevor Morris
5962e70d8d FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
2025-06-22 13:38:47 -07:00
Pavani Majety
c2c4f57f63 [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-06-07 17:24:35 -07:00
JieXin Liang
97cb762bb6 [misc] remove is_cuda_available (#5319) 2025-04-20 18:16:51 -07:00
Trevor Morris
11d760d56a FP4 weight loading and inference (2/2) (#3972) 2025-04-08 17:26:21 -07:00
Yun Dai
2695ab0537 Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-04-08 09:11:35 -07:00
Xiaoyu Zhang
9b81f9bd34 sglang quant module remove vllm dependency (#4507) 2025-03-17 15:51:59 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
HandH1998
0dd6cda288 Apply sgl w8a8 fp8 kernel (#3148) 2025-03-09 00:03:32 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00