Commit Graph

1941 Commits

Author SHA1 Message Date
Sundara Raman Ramachandran
f08154193c Perform Batch Tokenization. (#5141) 2025-04-20 18:10:37 -07:00
fzyzcjy
5fc4b6004e Add sanity check for max_running_requests (#5016) 2025-04-20 17:56:49 -07:00
Brayden Zhong
b868526d94 Fix one more issue reported by torchfix (#4859) 2025-04-20 17:49:27 -07:00
Juwan Yoo
502524e2da compressed_tensors: port w8a16 fp8 from vllm (#4852) 2025-04-20 17:48:31 -07:00
Enrique Shockwave
4c7640079c check marlin format before attempting conversion (#4675) 2025-04-20 17:47:09 -07:00
kyle-pena-kuzco
9f3bd2ad39 Feat: Implement JSON Mode (response_format.type="json_object") (#4733)
Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>
2025-04-20 17:41:22 -07:00
Yi Zhou
fac17acf08 add function call parser for DeepSeek V3 (#5224) 2025-04-20 17:38:08 -07:00
Adarsh Shirawalmath
8b39274e34 [Feature] Prefill assistant response - add continue_final_message parameter (#4226)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-04-20 17:37:18 -07:00
Byron Hsu
c951d312ed [PD] Fix large page size + chunk prefill (#5588) 2025-04-20 17:21:54 -07:00
AmadeusW
dcb8232596 Fix ChatCompletionMessageGenericParam to allow for None content (#5452) 2025-04-20 17:15:38 -07:00
Yineng Zhang
66c0ff9e31 fix: use fa3 for gemma2 (#5586) 2025-04-20 17:02:09 -07:00
tarinkk
9a7e83e899 Fix enable chunked prefill for Llama4 (#5575) 2025-04-20 17:01:30 -07:00
lukec
417b44eba8 [Feat] upgrade pytorch2.6 (#5417) 2025-04-20 16:06:34 -07:00
fzyzcjy
475e2e378a [PD] Fix server crash when using batch requests (#5531) 2025-04-20 16:02:23 -07:00
fzyzcjy
fba86b6b54 Tiny improve error message (#5526) 2025-04-20 16:00:15 -07:00
fzyzcjy
fa2f677e18 Fix torch memory saver not enabled in DP scenario (#5560) 2025-04-20 14:20:52 -07:00
fzyzcjy
463d4b7400 Fix DeepEP cannot run on latest master (#5567) 2025-04-20 14:19:42 -07:00
fzyzcjy
9924bbe153 Fix bench_serving fail when zero warmup requests (#5574) 2025-04-20 14:16:03 -07:00
Lianmin Zheng
fbdc94ba59 Release v0.4.5.post2 (#5582) 2025-04-20 14:12:37 -07:00
Baizhou Zhang
b54b5a96e4 [Doc]Add instruction for profiling with bench_one_batch (#5581) 2025-04-20 14:05:36 -07:00
JieXin Liang
bca832c7c6 [Fix] fix outlines and xgrammar (#4947) 2025-04-20 13:31:25 -07:00
Xiaoyu Zhang
d9dd529854 enable DeepSeek V3 shared_experts_fusion in sm90 (#5571) 2025-04-20 12:46:42 -07:00
fzyzcjy
0a0dd34e6a Fix BumpAllocator error when no input_ids (#5564) 2025-04-20 02:20:53 -07:00
fzyzcjy
80ac527d22 [PD] Fix DeepSeek cannot be run on latest master (#5568) 2025-04-20 02:19:48 -07:00
JieXin Liang
99456bcacb [perf] introduce deep gemm group_gemm_masked as bmm (#5432) 2025-04-20 00:38:27 -07:00
fzyzcjy
d07e797ace Fix bench_one_batch producing unnatural results for expert parallel (#5149) 2025-04-20 00:38:04 -07:00
Zhiqiang Xie
e2574ee986 fix hicache write back (#5543) 2025-04-19 21:56:22 -07:00
Byron Hsu
ab4b5606e4 [PD] Support page size > 1 (#5561) 2025-04-19 21:54:27 -07:00
Yubo Wang
20f1c8e374 Fix sampler nan check when calling top_k_top_p_sampling_from_probs (#5546) 2025-04-19 21:47:23 -07:00
fzyzcjy
613b197e57 Remove one kernel in per_tensor_quant_mla_fp8 (#5549) 2025-04-19 15:08:15 -07:00
Xiaoyu Zhang
d58e354472 simplify the control logic for using shared experts fusion (#5504) 2025-04-19 13:17:35 -07:00
Xiaoyu Zhang
bf86c5e990 restruct compressed_tensors_w8a8_fp8 (#5475) 2025-04-19 04:52:15 -07:00
shangmingc
dca90f1db8 [PD] Remove the requirement of config file for mooncake backend (#5460) 2025-04-19 19:31:00 +08:00
ybyang
59dd090f1c [PD] Fix no cache connect for recevier (#5534) 2025-04-19 14:55:28 +08:00
fzyzcjy
569b032c58 [PD] Tiny fix timeout error when generate (#5545) 2025-04-19 14:42:57 +08:00
fzyzcjy
f6a71139a8 Make profiler output file names consistent (#5548) 2025-04-18 22:57:11 -07:00
fzyzcjy
1e0806f30b Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (#5340) 2025-04-18 22:38:07 -07:00
Yineng Zhang
2c11f9c2eb chore: upgrade sgl-kernel 0.0.9.post2 (#5540) 2025-04-18 21:17:23 -07:00
Yineng Zhang
a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) 2025-04-18 16:50:21 -07:00
Yineng Zhang
08b518d51f fix util import (#5542) 2025-04-18 15:06:46 -07:00
yhyang201
4db463b1ad [Model] Adding Qwen3 and Qwen3MoE (#4693) 2025-04-18 09:51:29 -07:00
Wenxuan Tan
bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 01:13:57 -07:00
liwenju0
e465b08ddb fix bug of VLLM_AVAILABLE not defined (#5497) 2025-04-18 00:59:03 -07:00
Xiaoyu Zhang
bed05878f6 fix kimi vl running bug after rebase main (#5461) 2025-04-18 00:17:34 -07:00
strgrb
b2a189dd11 use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (#5473)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
2025-04-18 00:05:24 -07:00
fzyzcjy
53dcf38876 Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (#4836) 2025-04-17 21:38:26 -07:00
Michael Feil
1effba4c70 Configuration qwen2_moe.py - qkv_bias now in transformers (#5512) 2025-04-17 21:23:22 -07:00
mlmz
27e9538a7e Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (#5426)
Co-authored-by: ocss884 <ocss.lin@gmail.com>
2025-04-18 10:51:39 +08:00
u4lr451
211c7b31b8 Fix: Incorrect parameters passed to forward_batch_generation (#5506) (#5511) 2025-04-17 18:49:59 -07:00
Xuchun Shang
06d0a3d92b [Bug fix] use correct func path in deepseek (#5496)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
2025-04-17 02:41:41 -07:00