Commit Graph

2860 Commits

Author SHA1 Message Date
eigen
8f783c1943 [Model Support] unsloth/Phi-4-mini bnb model (#4982)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-04-16 19:58:20 -07:00
BearBiscuit
90faf9018e [verl] Modify the update_weights func to align with verl's resharding (#5345)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-04-16 19:56:57 -07:00
Lianmin Zheng
177320a582 Clean up imports (#5467) 2025-04-16 15:26:49 -07:00
Ying Sheng
d7bc19a46a add multi-lora feature in README.md (#5463) 2025-04-16 03:25:25 -07:00
Elfie Guo
85ec0440a5 Update cutlass dependency. (#5447) 2025-04-15 23:28:04 -07:00
Xiaoyu Zhang
06a1656e02 [doc] Update benchmark_and_profiling.md (#5449) 2025-04-15 23:27:34 -07:00
Cheng Wan
6aca583420 Fix several minor issues in PD disaggregation (#5444) 2025-04-15 23:04:41 -07:00
Yineng Zhang
5b5c7237c8 chore: bump v0.4.5.post1 (#5445) 2025-04-15 23:00:07 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
ybyang
dd83e7e9c3 [Bug fix] need record start time in pd mode (#5425) 2025-04-16 10:11:16 +08:00
Lianmin Zheng
0769b14bf9 [Minor] Move torch.compile patch to a better place (#5397) 2025-04-15 18:37:07 -07:00
Michael Yao
b64b88e738 [Docs] Update start/install.md (#5398) 2025-04-15 18:12:26 -07:00
ryang
bc24205b32 Support BNB quantization for llama/mllama (#5038)
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
2025-04-15 18:00:31 -07:00
mRSun15
3efc8e2d2a add attention backend supporting matrix in the doc (#5211)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-04-15 17:16:34 -07:00
Chang Su
27a009bb00 Fix ignore_eos parameter when loading a chat template (#5264) 2025-04-15 17:09:45 -07:00
Yineng Zhang
8ec0bb7d55 chore: upgrade sgl-kernel 0.0.9.post1 (#5436) 2025-04-15 15:45:51 -07:00
Yineng Zhang
fa909dc3c4 feat: update model_specific_adjustment (#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-04-15 14:45:15 -07:00
Trevor Morris
e8f62b20ca BLackwell cutlass mla: Add check for bad page size/block num combinations (#5431) 2025-04-15 14:07:42 -07:00
Yineng Zhang
88defc4d89 fix: solve release issue (#5434) 2025-04-15 12:58:11 -07:00
Yineng Zhang
6f509d5503 chore: bump sgl-kernel v0.0.9.post1 (#5430) 2025-04-15 11:00:21 -07:00
DefTruth
12ef7e3bc3 bugfix: fix merge_state_v2 cuda graph (#5419) 2025-04-15 10:18:47 -07:00
Lianmin Zheng
838fa0f218 [minor] cleanup cmakelists.txt (#5420) 2025-04-15 07:07:07 -07:00
shangmingc
f1b3b75fc6 [PD] Remove unused bootstrap param and fix port table type (#5423) 2025-04-15 21:21:20 +08:00
Liangsheng Yin
33b16ad178 Distinguish bootstrap key only in decode server (#5422) 2025-04-15 20:59:28 +08:00
shangmingc
ffde65a094 [PD] Fix dynamic port support and MLA buffer for Mooncake (#5415)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>
2025-04-15 19:29:31 +08:00
lambert0312
471650dee0 Fix broadcast use cuda device lead to memory capacity unbalanced (#5416) 2025-04-15 02:47:26 -07:00
Yuan Luo
d06a83fb01 Support dynamic connection and TP 16 (#5351)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-04-15 17:08:07 +08:00
Zhaoyang Hao
5d13440162 [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (#5412) 2025-04-15 01:42:27 -07:00
JieXin Liang
f88f7e1943 [misc] fix ci flaky case (#5352) 2025-04-15 01:37:16 -07:00
Yuhong Guo
3dfc6023ce Fix bench_serving with random-ids (#5214) 2025-04-15 01:34:35 -07:00
fzyzcjy
15e91d721b Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (#5406) 2025-04-15 01:33:47 -07:00
Yineng Zhang
8aab7fdb21 chore: upgrade sgl-kernel 0.0.9 (#5401) 2025-04-14 22:37:59 -07:00
Yineng Zhang
e940dc4f06 chore: bump sgl-kernel 0.0.9 (#5400) 2025-04-14 21:34:04 -07:00
DefTruth
388e15c0db kernel: support slightly faster merge_state_v2 cuda kernel (#5381) 2025-04-14 21:28:23 -07:00
Yineng Zhang
11421a3f44 fix: update pr-test-sgl-kernel (#5399) 2025-04-14 21:14:59 -07:00
Yineng Zhang
6c41fcf0e4 chore: upgrade DeepGEMM (#5395) 2025-04-14 20:32:46 -07:00
Yangcheng Li
ee9d6ca677 [fix/misc] remove duplicate row in deepseek v2 model (#5279) 2025-04-14 18:41:24 -07:00
Ximingwang-09
2dd6489468 Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5291)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-04-14 18:40:31 -07:00
lambert0312
61e7c4dd21 Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5368) 2025-04-14 18:39:44 -07:00
Lianmin Zheng
dae7944440 minor clean up of sgl-kernel/CMakeLists.txt (#5393) 2025-04-14 18:38:44 -07:00
Baizhou Zhang
f6772f1497 [Fix] Turn off DeepGEMM by default (#5263) 2025-04-14 17:45:44 -07:00
Yineng Zhang
ac5b78baf6 fix: update test config (#5392) 2025-04-14 17:39:47 -07:00
Xiaoyu Zhang
38076dea84 apply fused moe gate in ds v3/r1 (#5371)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-04-14 16:24:26 -07:00
Ke Bao
5e0a9b0981 Apply deepseek cuda rope (#5385)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-04-14 15:22:43 -07:00
JieXin Liang
bdde237562 [perf] experimental enhance fp8 per-tensor quant (#5370) 2025-04-14 12:35:43 -07:00
ybyang
e9fc2ac7b6 [PD Bug] fix MLA get_contiguous_buf_infos error (#5384) 2025-04-14 22:56:39 +08:00
Liangsheng Yin
44afde82d7 Fix PD disaggregation bugs (#5326) 2025-04-14 19:27:30 +08:00
yhyang201
072df75354 Support for Qwen2.5-VL Model in bitsandbytes Format (#5003) 2025-04-14 02:03:40 -07:00
fzyzcjy
defede5073 Fix DeepSeek DP Attention + torch compile (#5367)
Co-authored-by: ispobock <ispobaoke@163.com>
2025-04-14 01:07:58 -07:00
Yongtong Wu
fc72871975 Free metadata_buffer_index after transfer finished (#5364) 2025-04-14 01:06:14 -07:00