sglang

Author	SHA1	Message	Date
fzyzcjy	d6e1d28c8a	Refactor DeepSeek attention dispatching (#6476 )	2025-05-21 02:03:39 -07:00
Zilin Zhu	7c347259ff	[RL] allow weight updation with dp attention enabled (#6311 )	2025-05-21 01:58:55 -07:00
Zilin Zhu	669caa0a3f	[router] support http2 in router (#6487 )	2025-05-21 01:42:45 -07:00
Jiajun Li	4024e1d2a8	Implement Siglip Vision model, and support BNB quantization for gemma3-mm (#5339 )	2025-05-20 23:53:46 -07:00
HAI	5c0b38f369	aiter attention-backend (default enabled on AMD/ROCm) (#6381 )	2025-05-20 22:52:41 -07:00
Yuan Luo	30ca18f423	Refactor group_concurrent_contiguous in NIXL (#6214 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-05-21 11:55:04 +08:00
Lianmin Zheng	03886917bd	Disable all two stream overlap on amd (#6475 )	2025-05-20 19:06:59 -07:00
Wenxuan Tan	66324895c6	[docs] Fix torch version (#6472 )	2025-05-20 10:53:14 -07:00
fzyzcjy	13feffd082	Fix master CI for DeepSeek (#6447 )	2025-05-20 00:31:42 -07:00
fzyzcjy	e98afbe042	Support dispatching logical to physical experts (#6385 )	2025-05-19 22:13:55 -07:00
JieXin Liang	69af3ec35f	[doc] add note for get_num_kv_splits in triton_backend (#6444 )	2025-05-19 21:40:21 -07:00
YanbingJiang	32cc66efa5	Update extend/decode attention kernel for CPU in sgl-kernel and add UTs (#6405 ) Co-authored-by: mingfeima <mingfei.ma@intel.com>	2025-05-19 21:23:17 -07:00
PGFLMG	83f2d9d4ed	[QuickFix] fix gptq model initialize (#6429 )	2025-05-19 21:17:10 -07:00
HAI	6317c5c61f	Address performance regression: disable multiple streams on ROCm (#6412 )	2025-05-19 21:16:20 -07:00
fzyzcjy	cba1cdbc46	Support DeepSeek EPLB algorithm with static distributions (#6387 )	2025-05-19 21:06:21 -07:00
fzyzcjy	c471d39eb9	Support loading weights when physical experts are different from logical experts (#6386 )	2025-05-19 21:05:53 -07:00
fzyzcjy	d0443275f0	Refactor DeepSeek logic into atomic operations (#6326 )	2025-05-19 21:05:30 -07:00
Liangsheng Yin	17d080b7ae	Remove `Cargo.lock`, add it into .gitignore (#6438 )	2025-05-20 12:01:32 +08:00
fzyzcjy	1b19df4b2a	Refactor communication logic of DeepSeek for extensibility and understandability (#6321 )	2025-05-19 20:14:48 -07:00
fzyzcjy	f0653886a5	Expert distribution recording without overhead for EPLB (#4957 )	2025-05-19 20:07:43 -07:00
Yineng Zhang	b146555749	Revert "Implement `return_hidden_states` for the OpenAI API (#6137 )" (#6440 )	2025-05-19 18:21:29 -07:00
Yi Zhang	b06215daed	[BUG] fix stop_profile crash (#6431 )	2025-05-19 17:30:33 -07:00
Trevor Morris	7adf245ba2	[Metrics] Add KV events publishing (#6098 )	2025-05-19 14:19:54 -07:00
Baizhou Zhang	299fd22f9e	Fix throughput threshold for amd ci test (#6414 )	2025-05-19 14:17:41 -07:00
simveit	506e5de8fe	Improve supported models doc (#6430 )	2025-05-20 01:43:35 +08:00
lukec	844e2f227a	Fix nodeepgemm init (#6417 )	2025-05-19 00:44:03 -07:00
kyle-pena-kuzco	4f39bcf7ab	Implement `return_hidden_states` for the OpenAI API (#6137 )	2025-05-18 22:30:25 -07:00
fzyzcjy	31c9569bb8	Fix request id error (#6401 )	2025-05-18 18:58:59 -07:00
Chang Su	1be6956d1b	[Bugfix] Fix field error in v1_embedding_request (#6400 )	2025-05-18 15:58:29 -07:00
Mick	626ccb7d3f	vlm: tensor hash kernel (#5974 )	2025-05-18 15:38:16 -07:00
fzyzcjy	72bfb0baf0	Refactor DeepSeek MoE layer to unify the two forward branches (#6325 )	2025-05-18 15:34:36 -07:00
wangxiyu191	155214952b	refactor: Extract repeated member variables in KVCache subclasses to base class. (#6323 )	2025-05-18 15:28:15 -07:00
Chang Su	ebe58d545d	[Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py (#6333 )	2025-05-18 15:27:13 -07:00
Chang Su	066cf44546	[OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397 )	2025-05-18 13:05:38 -07:00
applesaucethebun	6dc6b30637	Add missing model to doc (#6396 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-18 12:57:58 -07:00
JieXin Liang	1f30c05d4a	[fix] fix fa3 forward_decode with spec_decode (#6395 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com>	2025-05-18 12:50:15 -07:00
Chunyuan WU	5dd62c3a6f	Add fp8 shared_expert kernel for CPU in sgl-kernel and add UT (#6339 ) Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com> Co-authored-by: mingfeima <mingfei.ma@intel.com>	2025-05-18 12:42:15 -07:00
fzyzcjy	f11481b921	Add 4-GPU runner tests and split existing tests (#6383 )	2025-05-18 11:56:51 -07:00
doujiang24	9d24c3ffb0	chore: tiny remove duplicated code (#6392 ) Signed-off-by: doujiang24 <doujiang24@gmail.com>	2025-05-18 02:17:32 -07:00
Yury Sulsky	24161c5913	The Gemma template is missing a newline after the user role. (#6331 ) Co-authored-by: Yury Sulsky <ysulsky@tesla.com>	2025-05-18 01:57:27 -07:00
Yineng Zhang	eabcf82acb	feat: add long context example (#6391 )	2025-05-18 01:45:17 -07:00
Sai Enduri	c47a51db7e	Clean up AMD CI (#6365 )	2025-05-18 01:17:28 -07:00
libra	11553c1a37	Add pipeline parallelism for Qwen2 and Qwen3 Model (#6250 )	2025-05-18 00:42:55 -07:00
Mick	01dd39bac1	refactor: minor refactors regarding multimodal processing (#6187 )	2025-05-17 22:53:20 -07:00
Lianmin Zheng	b3f3d610fd	Do not use FA3 for mistral (#6379 )	2025-05-17 19:47:34 -07:00
Yineng Zhang	f07c6a009b	chore: upgrade sgl-kernel v0.1.3 (#6377 )	2025-05-17 19:47:05 -07:00
Lianmin Zheng	4bb816d444	Fix CI tests (#6362 )	2025-05-17 19:16:45 -07:00
ybyang	c250939ecb	[Fix Chat API] add request id for chat/completion for tracing (#6364 )	2025-05-17 18:58:22 -07:00
ishandhanani	b6909aa223	fix: allow `launch_dummy_health_check_server` to start inside of running asyncio loop (#6330 )	2025-05-17 18:32:41 -07:00
fzyzcjy	f87283573e	Add expert distribution APIs for engine (#6290 )	2025-05-17 18:31:51 -07:00

1 2 3 4 5 ...

3344 Commits