Commit Graph

2891 Commits

Author SHA1 Message Date
Yineng Zhang
0961feefca feat: use flashinfer jit package (#5547) 2025-04-19 00:28:39 -07:00
ybyang
59dd090f1c [PD] Fix no cache connect for recevier (#5534) 2025-04-19 14:55:28 +08:00
fzyzcjy
569b032c58 [PD] Tiny fix timeout error when generate (#5545) 2025-04-19 14:42:57 +08:00
fzyzcjy
f6a71139a8 Make profiler output file names consistent (#5548) 2025-04-18 22:57:11 -07:00
fzyzcjy
1e0806f30b Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (#5340) 2025-04-18 22:38:07 -07:00
Yineng Zhang
2c11f9c2eb chore: upgrade sgl-kernel 0.0.9.post2 (#5540) 2025-04-18 21:17:23 -07:00
Yineng Zhang
a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) 2025-04-18 16:50:21 -07:00
Yineng Zhang
08b518d51f fix util import (#5542) 2025-04-18 15:06:46 -07:00
yhyang201
4db463b1ad [Model] Adding Qwen3 and Qwen3MoE (#4693) 2025-04-18 09:51:29 -07:00
Wenxuan Tan
bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 01:13:57 -07:00
liwenju0
e465b08ddb fix bug of VLLM_AVAILABLE not defined (#5497) 2025-04-18 00:59:03 -07:00
Xiaoyu Zhang
bed05878f6 fix kimi vl running bug after rebase main (#5461) 2025-04-18 00:17:34 -07:00
strgrb
b2a189dd11 use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (#5473)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
2025-04-18 00:05:24 -07:00
Yineng Zhang
f28d82997a chore: bump sgl-kernel 0.0.9.post2 (#5518) 2025-04-17 23:42:39 -07:00
Xiaoyu Zhang
8e09b37077 Sgl kernel fused_moe_gate support n_shared_experts (#5440) 2025-04-17 23:05:15 -07:00
fzyzcjy
53dcf38876 Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (#4836) 2025-04-17 21:38:26 -07:00
Michael Feil
1effba4c70 Configuration qwen2_moe.py - qkv_bias now in transformers (#5512) 2025-04-17 21:23:22 -07:00
Michael Yao
a0fc5bc144 [docs] Fix several consistency issues in sampling_params.md (#5373)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 10:54:40 +08:00
mlmz
27e9538a7e Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (#5426)
Co-authored-by: ocss884 <ocss.lin@gmail.com>
2025-04-18 10:51:39 +08:00
u4lr451
211c7b31b8 Fix: Incorrect parameters passed to forward_batch_generation (#5506) (#5511) 2025-04-17 18:49:59 -07:00
PGFLMG
c08a717c77 [Feat] Update sgl-kernel flashinfer to latest main version (#5500)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-04-17 12:43:23 -07:00
mlmz
f13d65a7ea Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (#5503) 2025-04-17 11:37:43 -07:00
Xuchun Shang
06d0a3d92b [Bug fix] use correct func path in deepseek (#5496)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
2025-04-17 02:41:41 -07:00
Michael Yao
22c2a79dc5 Fix a link in sgl-kernel/README.md (#5493) 2025-04-17 02:25:28 -07:00
fzyzcjy
8beb356f0d Refactor DeepSeek decoder layer branches (#5205) 2025-04-17 02:11:11 -07:00
Chang Su
c776234b45 Enable local attention during decode (#5479) 2025-04-17 02:07:43 -07:00
woodx
3bface15e6 Feat/support encoder model (like bert) (#4887) 2025-04-17 01:50:48 -07:00
Baizhou Zhang
6fb29ffd9e Deprecate enable-flashinfer-mla and enable-flashmla (#5480) 2025-04-17 01:43:33 -07:00
Baizhou Zhang
4fb05583ef Deprecate disable-mla (#5481) 2025-04-17 01:43:14 -07:00
Baizhou Zhang
81c891111f Add test for flash_attn_varlen_func kernel (#5484) 2025-04-17 01:42:56 -07:00
Didier Durand
92d1561b70 Update attention_backend.md: plural form (#5489) 2025-04-17 01:42:40 -07:00
eigen
8f783c1943 [Model Support] unsloth/Phi-4-mini bnb model (#4982)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-04-16 19:58:20 -07:00
BearBiscuit
90faf9018e [verl] Modify the update_weights func to align with verl's resharding (#5345)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-04-16 19:56:57 -07:00
Lianmin Zheng
177320a582 Clean up imports (#5467) 2025-04-16 15:26:49 -07:00
Ying Sheng
d7bc19a46a add multi-lora feature in README.md (#5463) 2025-04-16 03:25:25 -07:00
Elfie Guo
85ec0440a5 Update cutlass dependency. (#5447) 2025-04-15 23:28:04 -07:00
Xiaoyu Zhang
06a1656e02 [doc] Update benchmark_and_profiling.md (#5449) 2025-04-15 23:27:34 -07:00
Cheng Wan
6aca583420 Fix several minor issues in PD disaggregation (#5444) 2025-04-15 23:04:41 -07:00
Yineng Zhang
5b5c7237c8 chore: bump v0.4.5.post1 (#5445) 2025-04-15 23:00:07 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
ybyang
dd83e7e9c3 [Bug fix] need record start time in pd mode (#5425) 2025-04-16 10:11:16 +08:00
Lianmin Zheng
0769b14bf9 [Minor] Move torch.compile patch to a better place (#5397) 2025-04-15 18:37:07 -07:00
Michael Yao
b64b88e738 [Docs] Update start/install.md (#5398) 2025-04-15 18:12:26 -07:00
ryang
bc24205b32 Support BNB quantization for llama/mllama (#5038)
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
2025-04-15 18:00:31 -07:00
mRSun15
3efc8e2d2a add attention backend supporting matrix in the doc (#5211)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-04-15 17:16:34 -07:00
Chang Su
27a009bb00 Fix ignore_eos parameter when loading a chat template (#5264) 2025-04-15 17:09:45 -07:00
Yineng Zhang
8ec0bb7d55 chore: upgrade sgl-kernel 0.0.9.post1 (#5436) 2025-04-15 15:45:51 -07:00
Yineng Zhang
fa909dc3c4 feat: update model_specific_adjustment (#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-04-15 14:45:15 -07:00
Trevor Morris
e8f62b20ca BLackwell cutlass mla: Add check for bad page size/block num combinations (#5431) 2025-04-15 14:07:42 -07:00
Yineng Zhang
88defc4d89 fix: solve release issue (#5434) 2025-04-15 12:58:11 -07:00