Lianmin Zheng
|
9b8ebb2798
|
move more files under srt/utils (#11285)
|
2025-10-09 16:46:15 -07:00 |
|
Cheng Wan
|
3c06b673af
|
[8/N] MoE Refactor: deprecate EPMoE (#11211)
|
2025-10-07 21:51:41 -07:00 |
|
Bowen Bao
|
cd4b39a900
|
[quantization] Properly ignore quantization for layers excluded in quant_config (#11205)
|
2025-10-07 14:06:05 -07:00 |
|
Zhiyu
|
155cbb51f0
|
Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-06 13:24:15 -07:00 |
|
Bowen Bao
|
baee08601b
|
[quantization] Enable aiter mxfp4 fused_moe for Quark (#10048)
Co-authored-by: HaiShaw <hixiao@gmail.com>
|
2025-10-05 19:51:34 -07:00 |
|
fzyzcjy
|
5e786cca3a
|
Support single batch overlap (#10422)
|
2025-10-02 18:04:36 +08:00 |
|
fzyzcjy
|
0b9dfba787
|
Support dispatch low latency (#10263)
Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>
|
2025-10-02 18:02:19 +08:00 |
|
DevashishLal-CB
|
8831c55c3d
|
[model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (#9642)
|
2025-09-30 11:26:17 +08:00 |
|
kk
|
8ebf72fef3
|
[Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (#10981)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-09-26 22:12:22 -07:00 |
|
Yueyang Pan
|
8260574729
|
fix: fp8 quantization failure of qwen 2.5 VL 7B model (#10112)
Signed-off-by: PanJason <pyyjason@gmail.com>
|
2025-09-27 05:05:23 +00:00 |
|
Yineng Zhang
|
172bcf0152
|
Revert "Refactor kv_cache_scheme handling for quantization (#10132)" (#10935)
|
2025-09-25 20:14:15 -07:00 |
|
Mohammad Miadh Angkad
|
cd4da1f19b
|
Refactor kv_cache_scheme handling for quantization (#10132)
|
2025-09-25 10:32:15 -07:00 |
|
Xiaoyu Zhang
|
c4e314f986
|
Restruct sgl-kernel benchmark (#10861)
|
2025-09-25 07:45:25 +08:00 |
|
Lianmin Zheng
|
f47a2c67e6
|
[Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (#10825)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-23 16:48:12 -07:00 |
|
Lianmin Zheng
|
113f8f65a2
|
[Auto Sync] Update configurer.py (20250923) (#10765)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-22 17:34:34 -07:00 |
|
ronnie_zheng
|
e22f3a5ec9
|
[Ascend]optimize Qwen3 on Ascend (#10574)
Co-authored-by: c30031083 <chenxu140@huawei.com>
|
2025-09-22 17:18:36 -07:00 |
|
Yineng Zhang
|
f1d7892318
|
[Auto Sync] Update modelopt_quant.py (20250920) (#10688)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-20 02:37:49 -07:00 |
|
yhyang201
|
388c05d544
|
Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (#10579)
|
2025-09-18 11:44:43 -07:00 |
|
sogalin
|
c32fb7a24d
|
[ROCm] Fix fp8 quantization accuracy issue. (#10558)
|
2025-09-17 17:44:59 -07:00 |
|
b8zhong
|
b2435be682
|
Cache the result of is_blackwell platform check (#10498)
|
2025-09-15 22:30:28 -07:00 |
|
scut-cbq
|
a220c14f81
|
fix crash of DeepSeek-V3 update_weights_from_disk (#8863)
Co-authored-by: parkeychen <parkeychen@tencent.com>
|
2025-09-16 09:45:15 +08:00 |
|
Cheng Wan
|
2f8ba6fe82
|
[Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path (#10429)
|
2025-09-14 17:34:28 -07:00 |
|
fzyzcjy
|
ac964d2e58
|
Support global scale in addition to per expert scale for cutedsl moe (#10270)
|
2025-09-14 01:17:00 -07:00 |
|
fzyzcjy
|
fa46e2bd40
|
Support offloading in fp8 (#9948)
|
2025-09-14 01:14:28 -07:00 |
|
fzyzcjy
|
2df532ef20
|
Fix the global scale fix does not support EPLB and improve enabling condition (#10369)
|
2025-09-14 01:07:47 -07:00 |
|
fzyzcjy
|
efedbe6ca9
|
Fix global input scale incompatible with CuTe DSL moe (#10370)
|
2025-09-12 03:22:49 -07:00 |
|
Trevor Morris
|
c7e85f5378
|
fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296)
|
2025-09-11 20:19:17 -07:00 |
|
Shu Wang
|
3df05f4d6a
|
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199)
|
2025-09-11 20:18:43 -07:00 |
|
Yineng Zhang
|
6d55f60e77
|
Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292)
|
2025-09-10 18:24:23 -07:00 |
|
Pavani Majety
|
21176b0093
|
[Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-09-10 12:00:23 -07:00 |
|
Yineng Zhang
|
45b3a6a256
|
Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176)
|
2025-09-08 11:28:15 -07:00 |
|
Yineng Zhang
|
b7d1f17b8d
|
Revert "enable auto-round quantization model (#6226)" (#10148)
|
2025-09-07 22:31:11 -07:00 |
|
Weiwei
|
c8295d2353
|
enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
|
2025-09-07 22:05:35 -07:00 |
|
Even Zhou
|
b67c277f86
|
[Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013)
|
2025-09-07 21:50:49 -07:00 |
|
Cao E
|
7577f0e40f
|
Add graph runner support with torch compile on CPU (#7843)
|
2025-09-07 21:33:58 -07:00 |
|
Rain Jiang
|
7802586cab
|
fix the fp8 topk_config.correction_bias is none bug (#10040)
|
2025-09-07 20:28:14 -07:00 |
|
Cheng Wan
|
5a7e10fe4c
|
[MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 (#10144)
|
2025-09-07 19:43:59 -07:00 |
|
Cheng Wan
|
a5a03209e9
|
Fix circular import (#10107)
|
2025-09-06 01:34:17 -07:00 |
|
Kaixi Hou
|
9a719b7afc
|
[NVIDIA] Remove unused get_fused_moe_impl_class function (#9764)
|
2025-09-05 22:41:22 -07:00 |
|
Cheng Wan
|
3fa62da78c
|
[7/N] MoE Refactor: the implementation of new framework (#9269)
|
2025-09-05 21:09:09 -07:00 |
|
Morpheus Guo
|
4efe844a25
|
enable aiter gemm_a8w8_bpreshuffle for ptpc gemm (#8555)
|
2025-09-05 12:54:40 -07:00 |
|
fzyzcjy
|
5e5c30d9ab
|
Tiny let DeepGEMM scale checks cover more cases (#7182)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-09-05 19:52:32 +08:00 |
|
fzyzcjy
|
339f8eef09
|
[1/2] Optimizations and refactors about quant kernel (#9534)
|
2025-09-05 18:45:08 +08:00 |
|
kk
|
e96973742c
|
Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
|
2025-09-04 15:11:22 -07:00 |
|
Hubert Lu
|
2c562fd2d0
|
Fix Llama 4 with MXFP4 dynamic quant on MI35x (#9993)
|
2025-09-04 00:48:58 -07:00 |
|
Yineng Zhang
|
1b2ff4fb7f
|
Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" (#9959)
|
2025-09-03 00:50:04 -07:00 |
|
Yineng Zhang
|
2c7ca33abb
|
Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" (#9955)
|
2025-09-02 23:49:56 -07:00 |
|
kk
|
0dfd54d11d
|
Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: wghuang <wghuang@amd.com>
|
2025-09-02 22:26:28 -07:00 |
|
jingyu-ml
|
bcbeed714f
|
Qwen FP8/NVFP4 ModelOPT Quantization support (#7912)
Co-authored-by: Jingyu Xin <jingyux@nvidia.com>
|
2025-09-02 20:56:03 -07:00 |
|
Al-Ekram Elahee Hridoy
|
6243c36702
|
[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)
|
2025-09-02 19:31:15 -07:00 |
|