Commit Graph

23 Commits

Author SHA1 Message Date
Liangsheng Yin
516738b096 Depreate global_server_args_dict (#11528) 2025-10-13 19:34:43 +08:00
Cheng Wan
1bdd010291 Revert "Deprecate global_server_args_dict" (#11520) 2025-10-12 17:40:40 -07:00
Liangsheng Yin
1083e7e3df Deprecate global_server_args_dict (#11331) 2025-10-13 01:20:47 +08:00
kk
8ebf72fef3 [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (#10981)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-09-26 22:12:22 -07:00
yhyang201
388c05d544 Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (#10579) 2025-09-18 11:44:43 -07:00
Cheng Wan
3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) 2025-09-05 21:09:09 -07:00
Hubert Lu
2c562fd2d0 Fix Llama 4 with MXFP4 dynamic quant on MI35x (#9993) 2025-09-04 00:48:58 -07:00
Lianmin Zheng
4aeba40d7b [Sync] Update mxfp4.py (20250827) (#9724)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Shiyang Chen <shiyang@x.ai>
2025-08-27 17:00:09 -07:00
Lianmin Zheng
fd71b11b1d move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679) 2025-08-27 03:34:29 -07:00
Stefan He
a530b3ffdc [RL] fix register the same ops multiple times (#9564) 2025-08-26 16:24:44 -07:00
hlu1
ccd3fb946e [fix] Fix mxfp4 triton MoE tp bug (#9473)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-23 01:48:40 -07:00
fzyzcjy
0374304a2c Add enable_flashinfer_mxfp4_bf16_moe for higher precision and slower moe backend (#9004) 2025-08-23 15:38:40 +08:00
kk
1c1f8a118e Combine fp4.py and mxfp4.py into one file and support dynamic mxfp4 quantization in mxfp4.py (#9049)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-08-16 19:01:54 -07:00
Cheng Wan
84b006b278 Cleanup MoE Refactor (#9223) 2025-08-15 02:28:33 -07:00
Cheng Wan
295895120d [6/N] MoE Refactor: Cleanup MoE-related configs (#8849) 2025-08-14 21:14:53 -07:00
Xiaoyu Zhang
63d82a776a refine mxfp4 shuffling log (#9194) 2025-08-14 10:57:29 -07:00
fzyzcjy
5190ba7f42 Fuse two kernels of hidden states padding into quantization kernel (#9005)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-08-12 01:20:13 -07:00
Cheng Wan
1d24db8348 Expert Parallelism for GPT-OSS (#8944) 2025-08-08 00:46:42 -07:00
Xiaoyu Zhang
0d1e27a0c5 Better optimization log for gpt-oss model (#8953) 2025-08-08 00:11:48 -07:00
Xiaoyu Zhang
3ae33fcd0a Fix hopper launch gpt-oss model illegal memory (#8908) 2025-08-07 10:02:40 -07:00
Xiaoyu Zhang
47824c1488 [Perf] Auto enable best flashinfer mxfp4 kernel in b200 (#8898) 2025-08-07 01:08:41 -07:00
Xiaoyu Zhang
4373df5525 add flashinfer mxfp4 (#8847) 2025-08-06 16:23:41 -07:00
Ying Sheng
168033d5fb Support mxfp4 for GPT-OSS (#8843)
Co-authored-by: Co-author fzyzcjy <ch271828n@outlook.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: zhuofan1123 <zhuofanl@nvidia.com>
Co-authored-by: liz-badada <jinyanc@nvidia.com>
Co-authored-by: xutizhou <xutingz@nvidia.com>
Co-authored-by: linhu-nv <linhu@nvidia.com>
2025-08-06 00:05:25 -07:00