sglang

Author	SHA1	Message	Date
Yineng Zhang	c1dd773c19	fix: use fa3 unit test on hopper only (#5304 )	2025-04-11 15:10:49 -07:00
Yineng Zhang	6f8593799b	feat: add blackwell workflow (#5303 )	2025-04-11 13:42:00 -07:00
Yineng Zhang	f774a0d275	feat: add blackwell Dockerfile (#5302 )	2025-04-11 13:08:53 -07:00
Xiaoyu Zhang	60bcbf2a35	remove moe_align_block_size torch.zeros in small batch/expert mode (#5298 )	2025-04-11 12:13:55 -07:00
Adarsh Shirawalmath	a0a9f6d64f	[Docs] Remove the older supported docs section (#5301 )	2025-04-11 11:30:18 -07:00
Yineng Zhang	80aa8ca84e	fix: update update_wheel_index for cu128 (#5300 )	2025-04-11 09:31:03 -07:00
Adarsh Shirawalmath	4aa6bab0b0	[Docs] Supported Model Docs - Major restructuring (#5290 ) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>	2025-04-11 09:17:47 -07:00
Yusong Gao	c35dcfdb30	[PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (#5292 )	2025-04-11 23:03:07 +08:00
Yineng Zhang	c163bf4ff1	chore: bump sgl-kernel v0.0.8.post1 (#5289 )	2025-04-11 02:11:53 -07:00
Yineng Zhang	5598634326	chore: relax the torch version restriction for sgl-kernel compilation (#5288 )	2025-04-11 02:05:53 -07:00
Yineng Zhang	b75275b6f2	feat: add cu128 identifier for sgl-kernel (#5287 )	2025-04-11 01:58:46 -07:00
Yineng Zhang	7074e9ca20	fix: enable fp4 compilation on cu128 (#5286 )	2025-04-11 01:43:44 -07:00
Michael Yao	fc14cca088	Fix a 404 link in send_request.ipynb (#5280 ) Signed-off-by: windsonsea <haifeng.yao@daocloud.io>	2025-04-11 01:38:45 -07:00
XinyuanTong	e7beff8a13	fix: examples for token_in_token_out_vlm (#5193 )	2025-04-11 01:38:23 -07:00
mlmz	4d2e305149	doc: nested loop code for offline engine (#5244 )	2025-04-11 01:36:30 -07:00
Mick	e53a0b3d5b	[fix] fix mrope positions not picked up (#5265 )	2025-04-11 01:29:45 -07:00
Cheng Wan	038bc5d521	Support `--enable-llama4-multimodal` (#5254 )	2025-04-11 01:24:14 -07:00
Chang Su	aee62d744b	Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262 ) Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-11 00:34:17 -07:00
fzyzcjy	cd7e32e2cb	Optimize attention in llama4 (#5127 )	2025-04-11 00:32:41 -07:00
HAI	8879944800	ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (#5228 )	2025-04-10 18:19:57 -07:00
Richard Zou	a879811c4b	Fix torch.compile cacheing (#5259 ) Co-authored-by: zhyncs <me@zhyncs.com>	2025-04-10 18:08:45 -07:00
Elfie Guo	a222945df2	Update Makefile / build script to avoid installing incompatible torch dependency (#5245 )	2025-04-10 22:21:02 +00:00
PGFLMG	ed01b4515e	[Misc] Clean sgl-kernel test (#5216 )	2025-04-10 11:28:41 -07:00
HAI	d050df368c	ROCm sgl-kernel: compatible to later torch (#5167 )	2025-04-10 09:18:36 -07:00
Richard Zou	76f44c2a8d	Fix deepseek-v3 with torch.compile in PyTorch 2.6. (#5213 )	2025-04-10 09:14:38 -07:00
Ke Bao	1078396f47	Update deps for mllama4 (#5215 )	2025-04-10 09:12:44 -07:00
Teng Ma	7e4f72dd8c	[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (#5204 )	2025-04-10 20:05:34 +08:00
Teng Ma	4c31ae9f6d	[PD] Support KV transfer with mooncake (#4880 ) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: shangmingc <csmthu@gmail.com>	2025-04-10 14:23:23 +08:00
Xiaoyu Zhang	f730362ee2	reduce moe_align_block_size_kernel small batch mode overhead (#5086 )	2025-04-09 17:59:35 -07:00
fzyzcjy	e3c4bd3153	Fix DeepSeek error when using DeepEP mode (#5190 )	2025-04-09 17:43:22 -07:00
Stefan He	5db37c8626	[metrics] Add in queue metrics (#4444 )	2025-04-09 17:19:27 -07:00
Yineng Zhang	4cb53ecd0c	fix: log warning when disable cuda graph (#5209 )	2025-04-09 14:16:13 -07:00
Zhaoyang Hao	456b008bd8	Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (#5196 )	2025-04-09 11:54:36 -07:00
Yi Zhang	ebf495f013	sgl-kernel use cutlass latest version for fp8 blockwise gemm (#5207 )	2025-04-09 11:47:04 -07:00
saienduri	7f875f1293	update grok test (#5171 )	2025-04-09 11:09:47 -07:00
Mick	fbebcb7aa4	model: support mllama4 (#5144 )	2025-04-09 09:28:44 -07:00
Xiaoyu Zhang	87eddedfa2	[ci] fix ci test fused_moe op (#5102 )	2025-04-09 08:52:46 -07:00
HandH1998	4065248214	Support Llama4 fp8 inference (#5194 ) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com>	2025-04-09 20:14:34 +08:00
fzyzcjy	86a876d883	Optimize topk operation in llama4 (#5128 )	2025-04-09 02:50:22 -07:00
kk	92823069c4	Fix ci test "test_eval_fp8_accuracy" failed (#5185 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2025-04-09 02:44:05 -07:00
yinfan98	d2e507df3c	[Misc] clean up vllm in sgl-kernel test (#5189 )	2025-04-09 01:22:13 -07:00
fzyzcjy	61970b08d8	Let `bench_one_batch` support `enable_dp_attention` (#4058 )	2025-04-08 23:44:25 -07:00
Cheng Wan	76c48a0913	[DeepEP] fix: import buffer error (#5179 )	2025-04-08 22:12:14 -07:00
Yineng Zhang	90caf06c00	fix: use DeepEPDispatcher on CUDA (#5180 )	2025-04-08 21:56:53 -07:00
Yineng Zhang	6669d12707	feat: add DeepGEMM build warning (#5176 ) Co-authored-by: grimoire <streetyao@live.com>	2025-04-08 21:16:23 -07:00
Kay Yan	f2b70afde0	docs: remove the use of Downward API for LWS_WORKER_INDEX (#5110 ) Signed-off-by: Kay Yan <kay.yan@daocloud.io>	2025-04-08 20:46:11 -07:00
Jinyan Chen	bc3f6db2dd	[Fix] DeepEP Compatibility with Low Latency (#5068 ) Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-08 20:31:31 -07:00
Chang Su	aac531c53b	[Bugfix] Fix index out of bounds in local attention with large sequences (#5173 )	2025-04-08 18:43:13 -07:00
fzyzcjy	39efad4fbc	Tiny disable model that does not work (#5175 )	2025-04-08 18:42:37 -07:00
fzyzcjy	466899e69c	Fix multimodal hashing error (#5174 )	2025-04-08 18:42:26 -07:00

1 2 3 4 5 ...

2775 Commits