sglang

Author	SHA1	Message	Date
Lianmin Zheng	ac2387279e	Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>	2025-03-03 00:12:04 -08:00
laixin	b0df5d240b	Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3922 ) Co-authored-by: sleepcoo <sleepcoo@gmail.com>	2025-02-27 10:59:46 +00:00
laixin	1a6e97577a	Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730 ) Co-authored-by: HandH1998 <1335248067@qq.com>	2025-02-24 05:43:35 -08:00
Zhiyu	c66b2c9cf1	Add support for nvidia modelopt fp8 kv cache (#3223 )	2025-02-22 07:04:58 +08:00
HAI	5c54ef0352	AMD/ROCm: update AITER repo to ROCm/aiter (#3747 )	2025-02-21 00:18:08 -08:00
yizhang2077	1eb8eade2b	add control for cutlass fp8 blockwise gemm (#3727 )	2025-02-20 16:10:35 +08:00
Wen-Heng (Jack) Chung	2eab113206	[ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. (#3616 ) Co-authored-by: HAI <hixiao@gmail.com>	2025-02-17 15:54:18 -08:00
Xiaoyu Zhang	c38f3aed24	support multi-gpu block-gemm tuning (#3639 )	2025-02-18 00:00:35 +08:00
Yineng Zhang	5f1a485d9e	Revert "[ROCm] Use `tl.range()` in block GEMM kernels with `num_stage… (#3632 )	2025-02-17 18:01:21 +08:00
Wen-Heng (Jack) Chung	03caefeb51	[ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. (#3535 ) Co-authored-by: HAI <hixiao@gmail.com>	2025-02-16 01:40:38 -08:00
Wen-Heng (Jack) Chung	871a4aa1bf	[ROCm] Add ROCm tuning configs for AMD Instinct MI325X. (#3536 )	2025-02-12 20:09:36 -08:00
yizhang2077	98eecbda54	integrate blockwise fp8 kernel (#3529 )	2025-02-13 04:39:33 +08:00
Liangsheng Yin	8616357a97	Fix deepseek awq v3 (#3450 )	2025-02-12 22:09:52 +08:00
Xiaoyu Zhang	45e3a7bc41	use sgl_per_token_group_quant_fp8 kernel (#3493 )	2025-02-12 18:40:42 +08:00
yigex	fdf04a1426	[ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics (#3418 ) Co-authored-by: Bruce Xue <yigex@xilinx.com> Co-authored-by: HAI <hixiao@gmail.com>	2025-02-10 23:55:04 -08:00
lukec	4530136e61	Add H20 fp8 w8a8 gemm config (#3386 )	2025-02-08 15:36:31 +08:00
Wen-Heng (Jack) Chung	32de54ed1a	[ROCm] Fix fp8 unrolledx4 matmul kernel. (#3325 ) Co-authored-by: HAI <hixiao@gmail.com>	2025-02-05 18:44:15 -08:00
Wen-Heng (Jack) Chung	7ab84948d8	[ROCm] Logic to decide whether to used manually unrolled kernel. (#3306 )	2025-02-04 19:12:20 -08:00
Wen-Heng (Jack) Chung	c2723a42a5	[ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. (#3299 )	2025-02-05 07:15:40 +08:00
Wen-Heng (Jack) Chung	c7256ca836	[ROCm] Add tuning configs for AMD Radeon Graphics. (#3294 )	2025-02-04 10:34:57 -08:00
kushanam	d54cee1441	adding Triton configs for DeepSeekV3 on Blackwell (#3272 )	2025-02-04 04:12:09 +08:00
Ke Bao	c02e313914	Fix block wise fp8 torch compile (#3232 )	2025-01-31 19:56:02 +08:00
yigex	351a72d40b	add dsv3 mi300 triton config for block scale (#3146 )	2025-01-27 17:25:53 +08:00
Lianmin Zheng	52c03f16b9	Add activation parameters to fused_moe (#3170 )	2025-01-27 00:23:37 -08:00
Ke Bao	656dcc1a99	Remove fp8 monkey patch (#2960 )	2025-01-18 15:00:29 +08:00
Yineng Zhang	5dc54f1a62	feat: remove vllm distributed (#2907 ) Co-authored-by: Zhangyi <1109276519@qq.com>	2025-01-17 22:31:51 +08:00
Yineng Zhang	bf8d07a6f9	feat: patch linear base (#2915 )	2025-01-16 18:00:03 +08:00
Ke Bao	cc0485bef2	Support w8a8 int8 quantization config (#2881 )	2025-01-14 17:07:49 +08:00
Lianmin Zheng	6249e4a19e	Revert "Integration of TurboMind AWQ" (#2866 )	2025-01-13 04:44:39 -08:00
Ke Bao	f3516c2894	Fix quant kernel accuracy issue (#2865 )	2025-01-13 20:32:17 +08:00
bjmsong	17de02f98d	Integration of TurboMind AWQ (#2828 ) Co-authored-by: root <bjmsong@126.com>	2025-01-13 20:14:16 +08:00
kk	42f3909963	Unify sglang coding style (#2856 ) Co-authored-by: Lin, Soga <soga.lin@amd.com>	2025-01-13 02:12:44 -08:00
Lianmin Zheng	72c7776355	Fix linear.py and improve weight loading (#2851 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-01-13 01:39:14 -08:00
kk	e808c1df3e	Integrate ROCm ater package for ck moe function feasibility (#2854 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Lin, Soga <soga.lin@amd.com>	2025-01-13 08:23:07 +00:00
Ke Bao	85b2e05770	Add int8 quant kernel (#2848 )	2025-01-13 13:16:58 +08:00
Ke Bao	b5fb4ef58a	Update modelopt config and fix running issue (#2792 )	2025-01-08 18:04:30 +08:00
Lianmin Zheng	8a6906127a	Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784 ) Co-authored-by: SangBin Cho rkooo567@gmail.com	2025-01-07 23:29:10 -08:00
Lianmin Zheng	bdc1acf6cd	Misc fix for min_p_sampling, --cuda-graph-bs (#2761 )	2025-01-07 02:52:53 -08:00
Zhiyu	287427e2e6	Enable Nvidia's ModelOpt fp8 quantized models (#2535 )	2025-01-06 14:54:52 -08:00
HAI	e6f523b5f2	fix typo in python/sglang/srt/layers/quantization/fp8.py (#2655 )	2024-12-29 23:45:02 -08:00
HandH1998	afa0341e57	Update Triton configs for block fp8 kernels (#2641 )	2024-12-29 22:53:47 +08:00
HAI	30828e7192	AMD: set weights and scaling numbers properly for block FP8 (#2637 )	2024-12-29 03:23:39 -08:00
Yineng Zhang	7863e4368a	add configs for block fp8 related kernels (#2628 ) Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-28 23:12:04 +08:00
Xiaoyu Zhang	9254a33ad4	avoid fused_moe_triton `padding` circular import (#2624 )	2024-12-28 14:01:35 +08:00
Yineng Zhang	635a042623	docs: update deepseek v3 example (#2592 )	2024-12-26 17:43:37 +08:00
HandH1998	53aed988cb	Refactor MoE (#2575 ) Co-authored-by: zhyncs <me@zhyncs.com>	2024-12-26 00:02:14 +08:00
Ke Bao	e835a50021	Reorg moe code (#2563 )	2024-12-24 01:10:22 +08:00
HAI	95f93f493a	Fp8 MoE optimizations on AMD (#2388 )	2024-12-07 21:18:26 +08:00
Yineng Zhang	d332aa3b0c	fix: resolve fp8 moe issue (#2387 )	2024-12-07 19:28:53 +08:00
Yineng Zhang	84d96b3ae5	Move FP8 to SGLang (#2370 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2024-12-06 15:42:10 +08:00

1 2

63 Commits