Chang Su
|
627974405d
|
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685)
|
2025-10-17 16:49:46 -07:00 |
|
fzyzcjy
|
065ce81574
|
Tiny cleanup fp4 gemm calls (#11537)
|
2025-10-13 14:48:22 -07:00 |
|
Zhiyu
|
155cbb51f0
|
Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-06 13:24:15 -07:00 |
|
fzyzcjy
|
5e786cca3a
|
Support single batch overlap (#10422)
|
2025-10-02 18:04:36 +08:00 |
|
fzyzcjy
|
0b9dfba787
|
Support dispatch low latency (#10263)
Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>
|
2025-10-02 18:02:19 +08:00 |
|
Yineng Zhang
|
172bcf0152
|
Revert "Refactor kv_cache_scheme handling for quantization (#10132)" (#10935)
|
2025-09-25 20:14:15 -07:00 |
|
Mohammad Miadh Angkad
|
cd4da1f19b
|
Refactor kv_cache_scheme handling for quantization (#10132)
|
2025-09-25 10:32:15 -07:00 |
|
Yineng Zhang
|
f1d7892318
|
[Auto Sync] Update modelopt_quant.py (20250920) (#10688)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-20 02:37:49 -07:00 |
|
fzyzcjy
|
ac964d2e58
|
Support global scale in addition to per expert scale for cutedsl moe (#10270)
|
2025-09-14 01:17:00 -07:00 |
|
fzyzcjy
|
2df532ef20
|
Fix the global scale fix does not support EPLB and improve enabling condition (#10369)
|
2025-09-14 01:07:47 -07:00 |
|
fzyzcjy
|
efedbe6ca9
|
Fix global input scale incompatible with CuTe DSL moe (#10370)
|
2025-09-12 03:22:49 -07:00 |
|
Trevor Morris
|
c7e85f5378
|
fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296)
|
2025-09-11 20:19:17 -07:00 |
|
Shu Wang
|
3df05f4d6a
|
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199)
|
2025-09-11 20:18:43 -07:00 |
|
Pavani Majety
|
21176b0093
|
[Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-09-10 12:00:23 -07:00 |
|
Yineng Zhang
|
45b3a6a256
|
Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176)
|
2025-09-08 11:28:15 -07:00 |
|
Cheng Wan
|
3fa62da78c
|
[7/N] MoE Refactor: the implementation of new framework (#9269)
|
2025-09-05 21:09:09 -07:00 |
|
jingyu-ml
|
bcbeed714f
|
Qwen FP8/NVFP4 ModelOPT Quantization support (#7912)
Co-authored-by: Jingyu Xin <jingyux@nvidia.com>
|
2025-09-02 20:56:03 -07:00 |
|
Pavani Majety
|
fcd72bd100
|
[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-08-29 17:13:52 -07:00 |
|
fzyzcjy
|
9dcdf5da03
|
Tiny fix wrong comments (#9589)
|
2025-08-25 03:08:10 -07:00 |
|
fzyzcjy
|
433266c125
|
Reintroduce memory usage fix (#9535)
|
2025-08-25 00:02:31 -07:00 |
|
Trevor Morris
|
a91e90d9a3
|
[2/2] Fuse routed scaling factor into select_experts (#8690)
|
2025-08-20 15:10:16 -07:00 |
|
Zhiyu
|
2256d62d36
|
Modelopt quant config adaptation (#8829)
|
2025-08-18 11:27:30 -07:00 |
|
fzyzcjy
|
b498cd21d7
|
Tiny make fp4 moe method parameters more static (#8520)
|
2025-08-17 13:26:02 -07:00 |
|
Trevor Morris
|
eff4eb3fdd
|
Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667)
|
2025-08-15 22:08:11 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
Alex Yang
|
1bc183c6de
|
Faster weight processing (trtllm-gen moe nvfp4) (#9162)
|
2025-08-13 21:09:34 -07:00 |
|
Jianwei Dong
|
83262dcb29
|
Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766)
|
2025-08-11 23:44:40 -07:00 |
|
Lianmin Zheng
|
9a44b643c6
|
Fix CI (#9012)
|
2025-08-09 13:33:42 -07:00 |
|
Shu Wang
|
288ae41f7a
|
[NVIDIA] Fix num_experts in modelopt_quant (#8811)
|
2025-08-06 14:35:07 -07:00 |
|
Yineng Zhang
|
5e91fed1c5
|
Revert "[NVIDIA]Fix local_num_experts for EP (#8779)" (#8797)
|
2025-08-04 23:30:43 -07:00 |
|
Shu Wang
|
b01eeb80f8
|
[NVIDIA]Fix local_num_experts for EP (#8779)
|
2025-08-04 22:01:14 -07:00 |
|
Trevor Morris
|
9bd4872a34
|
[bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' (#8768)
|
2025-08-04 11:08:08 -07:00 |
|
azhurkevich
|
915140fd18
|
[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (#8552)
Co-authored-by: Cheng Wan <cwan@x.ai>
|
2025-08-04 03:10:02 -07:00 |
|
fzyzcjy
|
7df2c0c2db
|
Reduce memory usage for fp4 moe (#8413)
|
2025-07-28 22:51:23 -07:00 |
|
Elfie Guo
|
5c9c275bc8
|
Use FlashInfer FP4 gemm. (#8241)
|
2025-07-27 01:05:22 -07:00 |
|
Kaixi Hou
|
85486b6f6f
|
[NVIDIA] Add Flashinfer MoE blockscale fp8 backend (#8036)
|
2025-07-27 00:34:41 -07:00 |
|
Trevor Morris
|
58c468f404
|
Fix FP4 MoE accuracy from missing routed_scaling_factor (#8333)
|
2025-07-25 16:40:23 -07:00 |
|
Cheng Wan
|
15ad6c9086
|
[1/N] MoE Refactor: refactor select_experts (#7966)
|
2025-07-19 00:51:15 -07:00 |
|
Cheng Wan
|
49b8777460
|
Refactor: move all quantization-related code to srt/layer/quantization (#7989)
|
2025-07-17 00:47:07 -07:00 |
|
Zhiyu
|
659907e32b
|
Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (#7129)
|
2025-07-08 00:19:50 -07:00 |
|
Trevor Morris
|
5962e70d8d
|
FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
|
2025-06-22 13:38:47 -07:00 |
|
Pavani Majety
|
c2c4f57f63
|
[DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-06-07 17:24:35 -07:00 |
|
JieXin Liang
|
97cb762bb6
|
[misc] remove is_cuda_available (#5319)
|
2025-04-20 18:16:51 -07:00 |
|
Trevor Morris
|
11d760d56a
|
FP4 weight loading and inference (2/2) (#3972)
|
2025-04-08 17:26:21 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
Xiaoyu Zhang
|
9b81f9bd34
|
sglang quant module remove vllm dependency (#4507)
|
2025-03-17 15:51:59 -07:00 |
|
Lianmin Zheng
|
45de89719c
|
Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367)
|
2025-03-12 23:45:52 -07:00 |
|
Meng, Hengyu
|
71046fcd71
|
[XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
|
2025-03-12 22:26:29 -07:00 |
|
HandH1998
|
0dd6cda288
|
Apply sgl w8a8 fp8 kernel (#3148)
|
2025-03-09 00:03:32 -08:00 |
|
Lianmin Zheng
|
935cda944b
|
Misc clean up; Remove the support of jump forward (#4032)
|
2025-03-03 07:02:14 -08:00 |
|