sglang

Author	SHA1	Message	Date
kk	5a144a8ab9	Fix run time error in ROCm platform (#5147 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu>	2025-04-07 22:49:40 -07:00
Hubert Lu	afb752bcbe	[AMD] Fix missing per_token_group_quant_fp8 for ROCm (#5140 )	2025-04-07 22:38:25 -07:00
Chang Su	f04c80dc42	Add Llama4 support (#5092 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com>	2025-04-07 00:29:36 -07:00
Xiaoyu Zhang	924ca7c92c	Add DeepSeek V3/R1 shared experts fusion (#4918 )	2025-04-04 01:59:29 -07:00
fzyzcjy	6ff9c6a5e7	Cleanup unused resources after DeepEP operation (#4996 )	2025-04-04 00:36:04 -07:00
fzyzcjy	77e929a1a2	Support async DeepEP by splitting into two stages (#4995 )	2025-04-04 00:32:27 -07:00
fzyzcjy	febe21ce03	Small refactor DeepEPDispatcher into subclasses (#4994 )	2025-04-04 00:24:18 -07:00
Tommy Yang	31035dda44	Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (#5057 )	2025-04-03 22:14:28 -07:00
AniZpZ	d95269f9b3	[2/3] fix dsv3 awq issue (#4625 ) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>	2025-04-03 17:36:39 -07:00
fzyzcjy	8e10fec9a8	Small refactor DeepEPMode to clean up code a bit (#4992 )	2025-04-03 02:56:44 -07:00
saltyfish66	e41549c3d6	fix: fix illegal cuda memory access at fused_moe_kernel (#4727 ) Co-authored-by: yuethe <yuethe@tencent.com>	2025-04-03 00:07:32 -07:00
Jinyan Chen	23c764b18a	[Feature] Support DeepEP Low Latency (#4767 ) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-01 09:23:25 -07:00
Qingquan Song	044c315970	Make torch compile configurable for biased_grouped_topk (#4749 )	2025-03-28 10:57:52 -07:00
Lianmin Zheng	74e0ac1dbd	Clean up `import vllm` in quantization/__init__.py (#4834 )	2025-03-28 10:34:10 -07:00
Xiaoyu Zhang	04e3ff6975	Support compressed tensors fp8w8a8 (#4743 )	2025-03-26 13:21:25 -07:00
yuhsaun-t	199bb01d00	Add endpoints to dump selected expert ids (#4435 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-03-24 21:34:19 -07:00
fzyzcjy	ca75741e86	Support async in DeepEP (#4610 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu>	2025-03-22 22:39:56 -07:00
fzyzcjy	c6d549e773	Multiple tiny code cleanups (#4608 )	2025-03-22 22:39:11 -07:00
xutizhou	c2bd094d6e	Optimize Permute Kernel in DeepEP (#4643 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-03-22 14:30:34 -07:00
Jinyan Chen	f44db16c8e	[Feature] Integrate DeepEP into SGLang (#4232 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Xuting Zhou <xutingz@nvidia.com>	2025-03-19 08:16:31 -07:00
Ke Bao	3ded4b215d	Revert "feat: update grouped_topk to support softmax and sigmoid" (#4505 )	2025-03-17 11:30:26 -07:00
Wenbo Yang	75b656488a	Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418 )	2025-03-17 00:03:43 -07:00
Mick	0f52fb55ec	config: Update fused moe config (#4493 )	2025-03-16 23:51:58 -07:00
Mick	8ec2ce0726	perf: update fused moe config (#4459 )	2025-03-15 21:23:57 -07:00
Yineng Zhang	ad1ae7f7cd	use topk_softmax with sgl-kernel (#4439 )	2025-03-14 15:59:06 -07:00
Yineng Zhang	977d7cd26a	cleanup deps 1/n (#4400 ) Co-authored-by: sleepcoo <sleepcoo@gmail.com>	2025-03-14 00:00:33 -07:00
Lianmin Zheng	c6d7f8d370	Add some fused elementwise kernels for grok-1 (#4398 ) Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>	2025-03-13 13:39:10 -07:00
Cheng Wan	2f6bacee03	[moe] fix: correct the cache size in the last chunk (#3679 ) Co-authored-by: Abatom <abzhonghua@gmail.com>	2025-03-12 22:22:13 -07:00
Stefan He	e0917e6bd0	Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215 ) Co-authored-by: Stefan He <bhe@linkedin.com>	2025-03-12 00:08:03 -07:00
lambert0312	7140ba3573	Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4323 )	2025-03-11 18:25:56 -07:00
Yineng Zhang	d1da58e275	unify is_cuda and is_hip (#4321 )	2025-03-11 18:12:56 -07:00
Ximingwang-09	0f2a2e3c19	Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) (#4220 ) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>	2025-03-11 12:32:33 -07:00
yych0745	6a02b32d07	Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4287 ) Co-authored-by: HandH1998 <1335248067@qq.com>	2025-03-11 00:49:06 -07:00
lambert0312	d3ecd63204	Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) (#4136 )	2025-03-11 00:32:25 -07:00
Lianmin Zheng	00d25a7f5e	Fix quantization and nightly tests (#4258 )	2025-03-10 03:06:21 -07:00
HandH1998	c7f254468f	[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888 ) Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: b0urnee <2769086541@qq.com>	2025-03-06 20:54:52 -08:00
HAI	13bc39c5d6	ROCm: enable trillion-parameter MoE models with INT4-FP8 single node (#4152 )	2025-03-06 15:33:02 -08:00
HAI	71ab0dabe0	Fix the moe padding conditional logic (#4081 )	2025-03-05 10:56:51 -08:00
HAI	51d25405a7	ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) (#4053 )	2025-03-04 03:00:46 -08:00
kk	11eea69e70	Fix assert options.num_stages != 0 error in the latest ROCm build image (#4049 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2025-03-03 20:37:03 -08:00
Lianmin Zheng	66301e124f	Improve code styles (#4021 )	2025-03-03 03:20:23 -08:00
Lianmin Zheng	ac2387279e	Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>	2025-03-03 00:12:04 -08:00
laixin	b0df5d240b	Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3922 ) Co-authored-by: sleepcoo <sleepcoo@gmail.com>	2025-02-27 10:59:46 +00:00
lukec	21463e321a	Expert Parallelism (EP) Support for DeepSeek V3/R1 (#3602 ) Co-authored-by: laixin <xielx@shanghaitech.edu.cn> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: laixin <q865809639@gmail.com>	2025-02-26 02:29:37 -08:00
laixin	1a6e97577a	Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730 ) Co-authored-by: HandH1998 <1335248067@qq.com>	2025-02-24 05:43:35 -08:00
zixuanzhang226	0c227ee373	feat: update grouped_topk to support softmax and sigmoid (#3680 )	2025-02-21 16:30:15 +08:00
HAI	5c54ef0352	AMD/ROCm: update AITER repo to ROCm/aiter (#3747 )	2025-02-21 00:18:08 -08:00
chenxiaobing	d5d80ab477	[Bugfix] Fix scores mask for moe topk (#3705 )	2025-02-21 02:17:23 +08:00
Cheng Wan	6b0aeb58fd	[moe] optim: reduce memory consumption in fused_moe (#3692 )	2025-02-20 02:25:05 +08:00
yigex	ddf39d3fce	[ROCm] Optimal MOE Tuning for AMD Radeon Graphics (#3567 )	2025-02-17 17:54:10 -08:00

1 2

92 Commits