Commit Graph

334 Commits

Author SHA1 Message Date
Lianmin Zheng
9b8ebb2798 move more files under srt/utils (#11285) 2025-10-09 16:46:15 -07:00
Cheng Wan
3c06b673af [8/N] MoE Refactor: deprecate EPMoE (#11211) 2025-10-07 21:51:41 -07:00
Bowen Bao
cd4b39a900 [quantization] Properly ignore quantization for layers excluded in quant_config (#11205) 2025-10-07 14:06:05 -07:00
Zhiyu
155cbb51f0 Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-10-06 13:24:15 -07:00
Bowen Bao
baee08601b [quantization] Enable aiter mxfp4 fused_moe for Quark (#10048)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2025-10-05 19:51:34 -07:00
fzyzcjy
5e786cca3a Support single batch overlap (#10422) 2025-10-02 18:04:36 +08:00
fzyzcjy
0b9dfba787 Support dispatch low latency (#10263)
Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>
2025-10-02 18:02:19 +08:00
DevashishLal-CB
8831c55c3d [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (#9642) 2025-09-30 11:26:17 +08:00
kk
8ebf72fef3 [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (#10981)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-09-26 22:12:22 -07:00
Yueyang Pan
8260574729 fix: fp8 quantization failure of qwen 2.5 VL 7B model (#10112)
Signed-off-by: PanJason <pyyjason@gmail.com>
2025-09-27 05:05:23 +00:00
Yineng Zhang
172bcf0152 Revert "Refactor kv_cache_scheme handling for quantization (#10132)" (#10935) 2025-09-25 20:14:15 -07:00
Mohammad Miadh Angkad
cd4da1f19b Refactor kv_cache_scheme handling for quantization (#10132) 2025-09-25 10:32:15 -07:00
Xiaoyu Zhang
c4e314f986 Restruct sgl-kernel benchmark (#10861) 2025-09-25 07:45:25 +08:00
Lianmin Zheng
f47a2c67e6 [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (#10825)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-09-23 16:48:12 -07:00
Lianmin Zheng
113f8f65a2 [Auto Sync] Update configurer.py (20250923) (#10765)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-09-22 17:34:34 -07:00
ronnie_zheng
e22f3a5ec9 [Ascend]optimize Qwen3 on Ascend (#10574)
Co-authored-by: c30031083 <chenxu140@huawei.com>
2025-09-22 17:18:36 -07:00
Yineng Zhang
f1d7892318 [Auto Sync] Update modelopt_quant.py (20250920) (#10688)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-09-20 02:37:49 -07:00
yhyang201
388c05d544 Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (#10579) 2025-09-18 11:44:43 -07:00
sogalin
c32fb7a24d [ROCm] Fix fp8 quantization accuracy issue. (#10558) 2025-09-17 17:44:59 -07:00
b8zhong
b2435be682 Cache the result of is_blackwell platform check (#10498) 2025-09-15 22:30:28 -07:00
scut-cbq
a220c14f81 fix crash of DeepSeek-V3 update_weights_from_disk (#8863)
Co-authored-by: parkeychen <parkeychen@tencent.com>
2025-09-16 09:45:15 +08:00
Cheng Wan
2f8ba6fe82 [Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path (#10429) 2025-09-14 17:34:28 -07:00
fzyzcjy
ac964d2e58 Support global scale in addition to per expert scale for cutedsl moe (#10270) 2025-09-14 01:17:00 -07:00
fzyzcjy
fa46e2bd40 Support offloading in fp8 (#9948) 2025-09-14 01:14:28 -07:00
fzyzcjy
2df532ef20 Fix the global scale fix does not support EPLB and improve enabling condition (#10369) 2025-09-14 01:07:47 -07:00
fzyzcjy
efedbe6ca9 Fix global input scale incompatible with CuTe DSL moe (#10370) 2025-09-12 03:22:49 -07:00
Trevor Morris
c7e85f5378 fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296) 2025-09-11 20:19:17 -07:00
Shu Wang
3df05f4d6a [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199) 2025-09-11 20:18:43 -07:00
Yineng Zhang
6d55f60e77 Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292) 2025-09-10 18:24:23 -07:00
Pavani Majety
21176b0093 [Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-09-10 12:00:23 -07:00
Yineng Zhang
45b3a6a256 Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176) 2025-09-08 11:28:15 -07:00
Yineng Zhang
b7d1f17b8d Revert "enable auto-round quantization model (#6226)" (#10148) 2025-09-07 22:31:11 -07:00
Weiwei
c8295d2353 enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
2025-09-07 22:05:35 -07:00
Even Zhou
b67c277f86 [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013) 2025-09-07 21:50:49 -07:00
Cao E
7577f0e40f Add graph runner support with torch compile on CPU (#7843) 2025-09-07 21:33:58 -07:00
Rain Jiang
7802586cab fix the fp8 topk_config.correction_bias is none bug (#10040) 2025-09-07 20:28:14 -07:00
Cheng Wan
5a7e10fe4c [MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 (#10144) 2025-09-07 19:43:59 -07:00
Cheng Wan
a5a03209e9 Fix circular import (#10107) 2025-09-06 01:34:17 -07:00
Kaixi Hou
9a719b7afc [NVIDIA] Remove unused get_fused_moe_impl_class function (#9764) 2025-09-05 22:41:22 -07:00
Cheng Wan
3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) 2025-09-05 21:09:09 -07:00
Morpheus Guo
4efe844a25 enable aiter gemm_a8w8_bpreshuffle for ptpc gemm (#8555) 2025-09-05 12:54:40 -07:00
fzyzcjy
5e5c30d9ab Tiny let DeepGEMM scale checks cover more cases (#7182)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-09-05 19:52:32 +08:00
fzyzcjy
339f8eef09 [1/2] Optimizations and refactors about quant kernel (#9534) 2025-09-05 18:45:08 +08:00
kk
e96973742c Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
2025-09-04 15:11:22 -07:00
Hubert Lu
2c562fd2d0 Fix Llama 4 with MXFP4 dynamic quant on MI35x (#9993) 2025-09-04 00:48:58 -07:00
Yineng Zhang
1b2ff4fb7f Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" (#9959) 2025-09-03 00:50:04 -07:00
Yineng Zhang
2c7ca33abb Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" (#9955) 2025-09-02 23:49:56 -07:00
kk
0dfd54d11d Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: wghuang <wghuang@amd.com>
2025-09-02 22:26:28 -07:00
jingyu-ml
bcbeed714f Qwen FP8/NVFP4 ModelOPT Quantization support (#7912)
Co-authored-by: Jingyu Xin <jingyux@nvidia.com>
2025-09-02 20:56:03 -07:00
Al-Ekram Elahee Hridoy
6243c36702 [Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946) 2025-09-02 19:31:15 -07:00