Commit Graph

116 Commits

Author SHA1 Message Date
Lianmin Zheng
177320a582 Clean up imports (#5467) 2025-04-16 15:26:49 -07:00
Baizhou Zhang
f6772f1497 [Fix] Turn off DeepGEMM by default (#5263) 2025-04-14 17:45:44 -07:00
JieXin Liang
bdde237562 [perf] experimental enhance fp8 per-tensor quant (#5370) 2025-04-14 12:35:43 -07:00
Yineng Zhang
f58b929a51 chore: upgrade sgl-kernel 0.0.8.post3 (#5342) 2025-04-13 00:45:59 -07:00
Yineng Zhang
611720919d fix: use deepgemm only on hopper (#5310) 2025-04-11 20:48:24 -07:00
HAI
8879944800 ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (#5228) 2025-04-10 18:19:57 -07:00
HandH1998
4065248214 Support Llama4 fp8 inference (#5194)
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>
2025-04-09 20:14:34 +08:00
kk
92823069c4 Fix ci test "test_eval_fp8_accuracy" failed (#5185)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-04-09 02:44:05 -07:00
Yineng Zhang
6669d12707 feat: add DeepGEMM build warning (#5176)
Co-authored-by: grimoire <streetyao@live.com>
2025-04-08 21:16:23 -07:00
Trevor Morris
11d760d56a FP4 weight loading and inference (2/2) (#3972) 2025-04-08 17:26:21 -07:00
Yun Dai
2695ab0537 Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-04-08 09:11:35 -07:00
kk
88d6fd9a11 Fix torch compile errors (#5158) 2025-04-08 15:04:37 +00:00
Yubo Wang
804d9f2e4c Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760) 2025-04-07 23:20:51 -07:00
kk
5a144a8ab9 Fix run time error in ROCm platform (#5147)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu>
2025-04-07 22:49:40 -07:00
Xiaoyu Zhang
db452760e5 [ci] fix llama4 ci error (#5126) 2025-04-07 21:15:46 +08:00
HAI
819924748a Fix refactor error - fp8.py (#5106)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-04-07 00:34:08 -07:00
Chang Su
f04c80dc42 Add Llama4 support (#5092)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>
2025-04-07 00:29:36 -07:00
Xiaoyu Zhang
924ca7c92c Add DeepSeek V3/R1 shared experts fusion (#4918) 2025-04-04 01:59:29 -07:00
AniZpZ
d95269f9b3 [2/3] fix dsv3 awq issue (#4625)
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
2025-04-03 17:36:39 -07:00
Xiaoyu Zhang
e9c6ce461d sgl scaled_fp8_quant support output padding (#4861) 2025-04-02 23:53:57 +08:00
Lianmin Zheng
74e0ac1dbd Clean up import vllm in quantization/__init__.py (#4834) 2025-03-28 10:34:10 -07:00
Jiaqi
72031173e4 fix: fix typo of comments in w8a8_fp8.py (#4843) 2025-03-27 21:06:47 -07:00
laixin
ae25d36dc6 [3/3] fix dsv3 awq issue (#4719)
Co-authored-by: AniZpZ <aniz1905@gmail.com>
2025-03-26 23:13:43 -07:00
Xiaoyu Zhang
04e3ff6975 Support compressed tensors fp8w8a8 (#4743) 2025-03-26 13:21:25 -07:00
Stefan He
4c584fc632 Fix circular imports in gptq.py and unblock test explorer (#4736) 2025-03-24 18:07:08 -07:00
Yun Dai
8cd4250401 [quantization] fix channelwise conversion with scalar weight scale (#4596) 2025-03-22 00:47:52 -07:00
lukec
4c56e5dbee Set deepgemm to the default value in the hopper architecture. (#4613) 2025-03-20 22:03:00 -07:00
Cheng Wan
7b5fc71972 fix SUPPORT_CUTLASS_BLOCK_FP8 flag (#4640) 2025-03-20 21:45:07 -07:00
strgrb
f9c53cbb42 Create col-major and tma-aligned x_scale for deep_gemm.gemm_fp8_fp8_bf16_nt (#4515)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
2025-03-19 00:02:43 -07:00
Yineng Zhang
c16b33ccac cleanup deps 3/n (#4541) 2025-03-18 00:11:36 -07:00
Xiaoyu Zhang
dd865befde [Hotfix] solve fp8 w8a8 ci test fail (#4531) 2025-03-17 23:17:04 -07:00
Xiaoyu Zhang
9b81f9bd34 sglang quant module remove vllm dependency (#4507) 2025-03-17 15:51:59 -07:00
yiakwy-xpu-ml-framework-team
5f9b2c62ff [ROCm] fix dtype (#4510) 2025-03-17 05:20:50 -07:00
Stefan He
ef3c2dd08e Support Online Quantization for W8A8 (#4485) 2025-03-17 00:28:56 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
AniZpZ
85ef7f64e4 [FIX] fix incorrect output when enable both deepgemm and torch compile (#4359)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
2025-03-12 21:34:09 -07:00
Yineng Zhang
d1da58e275 unify is_cuda and is_hip (#4321) 2025-03-11 18:12:56 -07:00
Ximingwang-09
0f2a2e3c19 Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) (#4220)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-03-11 12:32:33 -07:00
lukec
dce303e279 linear support deepgemm (#4199)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-11 00:38:37 -07:00
lambert0312
d3ecd63204 Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) (#4136) 2025-03-11 00:32:25 -07:00
HandH1998
2ac189edc8 Amd test fp8 (#4261) 2025-03-10 10:12:09 -07:00
Lianmin Zheng
00d25a7f5e Fix quantization and nightly tests (#4258) 2025-03-10 03:06:21 -07:00
Lianmin Zheng
e8a69e4d0c Clean up fp8 support (#4230) 2025-03-09 21:46:35 -07:00
HandH1998
0dd6cda288 Apply sgl w8a8 fp8 kernel (#3148) 2025-03-09 00:03:32 -08:00
HandH1998
c7f254468f [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: b0urnee <2769086541@qq.com>
2025-03-06 20:54:52 -08:00
HAI
13bc39c5d6 ROCm: enable trillion-parameter MoE models with INT4-FP8 single node (#4152) 2025-03-06 15:33:02 -08:00
yigex
5be8f1ed98 ROCM: AITER BLOCK GEMM (#4075) 2025-03-05 03:10:49 -08:00
Qubitium-ModelCloud
56a724eba3 [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
2025-03-05 01:11:00 -08:00