sglang

Author	SHA1	Message	Date
Qiaolin Yu	69f453e5a4	Use device_group for all_gather when disabling overlap scheduling (#8001 )	2025-07-15 19:38:58 -07:00
Qiaolin Yu	3bc43c683e	Fix different device type adjustment in PP (#7760 )	2025-07-15 19:37:14 -07:00
Xinyuan Tong	6e923dbd30	feat: update multimodal data handling in engine entrypoint (#8002 ) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>	2025-07-15 00:12:22 -07:00
杨睿	9b560c3e1c	fix: modality length mismatch with image_data (#7887 )	2025-07-15 14:27:54 +08:00
ehuaa	64e78bb31b	prevent server crash from potential invalid grammar (#7897 )	2025-07-15 11:21:45 +08:00
Lifu Huang	d969504d9a	Fix flaky CI: test_vlm_models (#8006 )	2025-07-14 14:56:41 -07:00
Hanming Lu	9379da77de	SWA Prefix Cache (#7367 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2025-07-13 12:31:07 -07:00
ehuaa	0c55cbcfc5	[BugFix] add verify logit_bias to avoid crash because of IndexError (#7749 )	2025-07-14 02:44:12 +08:00
ronnie_zheng	86044712c6	[feature] kv transfer support of ascend npu (#7795 ) Co-authored-by: liupeng <liupeng374@huawei.com>	2025-07-11 00:07:51 -07:00
Mick	b5e3d6031c	vlm: support video as an input modality (#5888 )	2025-07-09 23:48:35 -07:00
kyleliang-nv	dd445a41f5	[feature] Add start step profile argument in /start_profile (#7608 )	2025-07-09 18:42:15 -07:00
Xinyuan Tong	136c6e0431	fix: Handles input_embeds in GenerateReqInput when n>1 (#7830 ) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>	2025-07-08 14:00:42 -07:00
Xinyuan Tong	4bab50a6b5	Fix llama4 vision (#7840 ) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>	2025-07-08 14:00:03 -07:00
Ziming Huang	9abe1163ac	fix duplicate args in schedule_batch (#7816 )	2025-07-07 01:31:03 -07:00
Zhiqiang Xie	2fc824b84c	Kernels for efficient KV cache IO (#7313 )	2025-07-06 22:53:36 -07:00
Yuan Luo	253454de9b	Integrate triton moe kernel (#7689 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-07-06 20:05:49 -07:00
yuhsuan-t	8d4a01cbd7	Log the timestamps of each prefill/decode iteration (#6094 ) Co-authored-by: yuhsuan-t <12108766+yuhsaun-t@users.noreply.github.com>	2025-07-07 01:57:27 +00:00
Nan Jiang	ba69c153f6	[RL]: Fix error tagging in multi-stage wake up (#7812 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-07-06 16:51:29 -07:00
Stefan He	3589aa79b0	[RL] Fix illegal memory for _import_static_state (#7733 ) Co-authored-by: nanjiangwill <willjiang2018@gmail.com>	2025-07-06 16:25:21 -07:00
Cheng Wan	8fc910db03	DP Attention with Auto DeepEP Dispatch (#7222 )	2025-07-05 01:54:24 -07:00
Leng Yue	8364608930	add model: qwen2-audio (#7596 )	2025-07-04 21:13:10 -07:00
Zilin Zhu	af46f299f9	[RL] add pause and continue generation for async rl training (#7419 )	2025-07-04 18:49:49 -07:00
Lianmin Zheng	14229ccf8f	Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (#7748 )	2025-07-04 16:33:33 -07:00
TianyuZhang1214	0099172327	feat: use D2D instead of H2H in pp (#7673 ) Co-authored-by: alpha-baby <fujianhao1997@qq.com>	2025-07-03 10:58:50 -07:00
Chunyuan WU	1dce6c480f	[CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (#6771 )	2025-07-03 09:51:38 -07:00
ronnie_zheng	1e0e549766	Ascend attention backend(PA&MLA) (#7722 ) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru>	2025-07-03 09:23:19 -07:00
Ziming Huang	1bebd3154e	Fix num_tokens_pre_allocated in disaggregation log (#7714 )	2025-07-02 22:31:49 -07:00
Albert	d3c275b117	Support updating weights at once by stopping all requests (#6698 ) Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>	2025-07-02 22:26:06 -07:00
Xiaoyu Zhang	8e64140e35	[b200] support trt-llm allreduce fuse rms_norm_add kernel (#7621 )	2025-07-02 19:36:20 -07:00
Zilin Zhu	0626f678de	[RL] support update_weights_from_distributed with different group and multiple weights (#7292 )	2025-07-02 19:29:11 -07:00
Lifu Huang	1a08358aed	Improve error handling for requests with unloaded LoRA path(s) (#7642 )	2025-07-01 20:05:34 -07:00
Xinyuan Tong	3a911b854d	Refactor mm processors and Enable mixed modality processing (#7629 ) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>	2025-06-30 23:14:48 -07:00
Lianmin Zheng	22352d47a9	Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632 ) Co-authored-by: Kan Wu <wukanustc@gmail.com>	2025-06-29 23:16:19 -07:00
fzyzcjy	0c9c6c75a8	Move files related to EPLB (#7580 )	2025-06-29 15:39:38 -07:00
Lianmin Zheng	071a1f51ae	[Minor] clean up multimodal processor and tokenizer manager (#7624 )	2025-06-29 02:50:14 -07:00
Lifu Huang	49538d111b	Support dynamic LoRA loading / unloading in engine/server API (#7446 )	2025-06-27 21:00:27 -07:00
tarinkk	eb6c2c1663	Hybrid kv cache for LLaMA4 (#6563 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>	2025-06-27 18:58:55 -07:00
Lianmin Zheng	ce3a3e8783	Move multimodal processors into a separate folder (#7581 )	2025-06-27 11:58:24 -07:00
Cheng Wan	1b8cf77b01	[Fix] incorrect assert in EPLB (#7575 )	2025-06-26 14:59:20 -07:00
Xinyuan Tong	9b00990bea	chore: remove vlm unnecessary import (#7541 ) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>	2025-06-26 01:38:15 -07:00
Stefan He	00fbd8a484	Fix typo of flash_cache (#7513 )	2025-06-25 02:04:41 -07:00
zixuanzhang226	f3cbd24541	feat: send kvmetrics from sglang scheduler (#6721 )	2025-06-25 01:57:49 -07:00
DangKai	bc2e5645c4	fix: force synchronization between TP workers when update_weights (#6626 ) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>	2025-06-25 01:35:59 -07:00
yilian49	587b4c6e92	EPLB support for MTP (#7510 )	2025-06-25 01:16:56 -07:00
ybyang	03c039c48e	[OAI] patch origin request_id logic (#7508 )	2025-06-24 20:09:38 -07:00
u4lr451	ed0a0b692c	Perormance: Enable cuda graph for dp idle batch (#7269 ) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-06-23 17:34:13 -07:00
Yuhong Guo	e5afb88b1c	Support weight loading without mmap (#7469 )	2025-06-23 15:13:59 -07:00
Liangsheng Yin	25549433e8	Fix prefill OOM due to wrong token calculation when page > 1 (#7397 )	2025-06-24 02:12:29 +08:00
Lianmin Zheng	55e03b10c4	Fix a bug in BatchTokenIDOut & Misc style and dependency updates (#7457 )	2025-06-23 06:20:39 -07:00
Trevor Morris	5962e70d8d	FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327 ) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com>	2025-06-22 13:38:47 -07:00

1 2 3 4 5 ...

889 Commits