Commit Graph

63 Commits

Author SHA1 Message Date
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
laixin
b0df5d240b Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3922)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-02-27 10:59:46 +00:00
laixin
1a6e97577a Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730)
Co-authored-by: HandH1998 <1335248067@qq.com>
2025-02-24 05:43:35 -08:00
Zhiyu
c66b2c9cf1 Add support for nvidia modelopt fp8 kv cache (#3223) 2025-02-22 07:04:58 +08:00
HAI
5c54ef0352 AMD/ROCm: update AITER repo to ROCm/aiter (#3747) 2025-02-21 00:18:08 -08:00
yizhang2077
1eb8eade2b add control for cutlass fp8 blockwise gemm (#3727) 2025-02-20 16:10:35 +08:00
Wen-Heng (Jack) Chung
2eab113206 [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. (#3616)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-17 15:54:18 -08:00
Xiaoyu Zhang
c38f3aed24 support multi-gpu block-gemm tuning (#3639) 2025-02-18 00:00:35 +08:00
Yineng Zhang
5f1a485d9e Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… (#3632) 2025-02-17 18:01:21 +08:00
Wen-Heng (Jack) Chung
03caefeb51 [ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. (#3535)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-16 01:40:38 -08:00
Wen-Heng (Jack) Chung
871a4aa1bf [ROCm] Add ROCm tuning configs for AMD Instinct MI325X. (#3536) 2025-02-12 20:09:36 -08:00
yizhang2077
98eecbda54 integrate blockwise fp8 kernel (#3529) 2025-02-13 04:39:33 +08:00
Liangsheng Yin
8616357a97 Fix deepseek awq v3 (#3450) 2025-02-12 22:09:52 +08:00
Xiaoyu Zhang
45e3a7bc41 use sgl_per_token_group_quant_fp8 kernel (#3493) 2025-02-12 18:40:42 +08:00
yigex
fdf04a1426 [ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics (#3418)
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-10 23:55:04 -08:00
lukec
4530136e61 Add H20 fp8 w8a8 gemm config (#3386) 2025-02-08 15:36:31 +08:00
Wen-Heng (Jack) Chung
32de54ed1a [ROCm] Fix fp8 unrolledx4 matmul kernel. (#3325)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-05 18:44:15 -08:00
Wen-Heng (Jack) Chung
7ab84948d8 [ROCm] Logic to decide whether to used manually unrolled kernel. (#3306) 2025-02-04 19:12:20 -08:00
Wen-Heng (Jack) Chung
c2723a42a5 [ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. (#3299) 2025-02-05 07:15:40 +08:00
Wen-Heng (Jack) Chung
c7256ca836 [ROCm] Add tuning configs for AMD Radeon Graphics. (#3294) 2025-02-04 10:34:57 -08:00
kushanam
d54cee1441 adding Triton configs for DeepSeekV3 on Blackwell (#3272) 2025-02-04 04:12:09 +08:00
Ke Bao
c02e313914 Fix block wise fp8 torch compile (#3232) 2025-01-31 19:56:02 +08:00
yigex
351a72d40b add dsv3 mi300 triton config for block scale (#3146) 2025-01-27 17:25:53 +08:00
Lianmin Zheng
52c03f16b9 Add activation parameters to fused_moe (#3170) 2025-01-27 00:23:37 -08:00
Ke Bao
656dcc1a99 Remove fp8 monkey patch (#2960) 2025-01-18 15:00:29 +08:00
Yineng Zhang
5dc54f1a62 feat: remove vllm distributed (#2907)
Co-authored-by: Zhangyi <1109276519@qq.com>
2025-01-17 22:31:51 +08:00
Yineng Zhang
bf8d07a6f9 feat: patch linear base (#2915) 2025-01-16 18:00:03 +08:00
Ke Bao
cc0485bef2 Support w8a8 int8 quantization config (#2881) 2025-01-14 17:07:49 +08:00
Lianmin Zheng
6249e4a19e Revert "Integration of TurboMind AWQ" (#2866) 2025-01-13 04:44:39 -08:00
Ke Bao
f3516c2894 Fix quant kernel accuracy issue (#2865) 2025-01-13 20:32:17 +08:00
bjmsong
17de02f98d Integration of TurboMind AWQ (#2828)
Co-authored-by: root <bjmsong@126.com>
2025-01-13 20:14:16 +08:00
kk
42f3909963 Unify sglang coding style (#2856)
Co-authored-by: Lin, Soga <soga.lin@amd.com>
2025-01-13 02:12:44 -08:00
Lianmin Zheng
72c7776355 Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-01-13 01:39:14 -08:00
kk
e808c1df3e Integrate ROCm ater package for ck moe function feasibility (#2854)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
2025-01-13 08:23:07 +00:00
Ke Bao
85b2e05770 Add int8 quant kernel (#2848) 2025-01-13 13:16:58 +08:00
Ke Bao
b5fb4ef58a Update modelopt config and fix running issue (#2792) 2025-01-08 18:04:30 +08:00
Lianmin Zheng
8a6906127a Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
2025-01-07 23:29:10 -08:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00
Zhiyu
287427e2e6 Enable Nvidia's ModelOpt fp8 quantized models (#2535) 2025-01-06 14:54:52 -08:00
HAI
e6f523b5f2 fix typo in python/sglang/srt/layers/quantization/fp8.py (#2655) 2024-12-29 23:45:02 -08:00
HandH1998
afa0341e57 Update Triton configs for block fp8 kernels (#2641) 2024-12-29 22:53:47 +08:00
HAI
30828e7192 AMD: set weights and scaling numbers properly for block FP8 (#2637) 2024-12-29 03:23:39 -08:00
Yineng Zhang
7863e4368a add configs for block fp8 related kernels (#2628)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-28 23:12:04 +08:00
Xiaoyu Zhang
9254a33ad4 avoid fused_moe_triton padding circular import (#2624) 2024-12-28 14:01:35 +08:00
Yineng Zhang
635a042623 docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
HandH1998
53aed988cb Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-12-26 00:02:14 +08:00
Ke Bao
e835a50021 Reorg moe code (#2563) 2024-12-24 01:10:22 +08:00
HAI
95f93f493a Fp8 MoE optimizations on AMD (#2388) 2024-12-07 21:18:26 +08:00
Yineng Zhang
d332aa3b0c fix: resolve fp8 moe issue (#2387) 2024-12-07 19:28:53 +08:00
Yineng Zhang
84d96b3ae5 Move FP8 to SGLang (#2370)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2024-12-06 15:42:10 +08:00