Commit Graph

654 Commits

Author SHA1 Message Date
Ravi Theja
c6a0cacc35 Update CI tests for Llama4 models (#6421) 2025-06-01 11:52:15 +08:00
Lianmin Zheng
2d72fc47cf Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00
Chang Su
e39bca0756 ci: relax test_function_call_required (#6786) 2025-05-30 19:18:42 -07:00
Jianan Ji
a2bb856543 Temporarily lower mmlu threshold for triton sliding window backend (#6785) 2025-05-30 18:40:50 -07:00
Chang Su
f18b068f15 feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream (#6784) 2025-05-30 17:05:17 -07:00
Chao Yang
4fac524b14 update llama4 chat template and pythonic parser (#6679)
Co-authored-by: Chang Su <chang.s.su@oracle.com>
2025-05-30 17:01:22 -07:00
Jianan Ji
22630ca242 Support sliding window in triton backend (#6509) 2025-05-30 01:11:53 -07:00
Chang Su
c673727e0e refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor parse_streaming_increment (#6715) 2025-05-29 00:08:45 -07:00
iLeGend
e06b076105 Fix PP for Qwen3 MoE (#6709) 2025-05-28 23:06:18 -07:00
shangmingc
d63e76f735 [CI] Fix setup of disaggregation with different tp (#6706)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-28 11:17:27 -07:00
shangmingc
c25231c679 [CI] Fix flaky pp single node test (#6689)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-28 00:40:26 -07:00
Sai Enduri
f4a8987f69 Update amd docker and nightly models. (#6687) 2025-05-28 00:08:08 -07:00
Chang Su
41ba767f0c feat: Add warnings for invalid tool_choice and UTs (#6582) 2025-05-27 16:53:19 -07:00
Ke Bao
f127355a30 Add batch test for draft extend (#6672) 2025-05-27 16:32:05 -07:00
fzyzcjy
87068b5cc7 Support gathering expert distribution details (#6665) 2025-05-27 15:32:59 -07:00
Junrong Lin
2103b80607 [CI] update verlengine ci to 4-gpu test (#6007) 2025-05-27 14:32:23 -07:00
Sai Enduri
eb8f02dd87 Update nightly thresholds and dependencies. (#6635) 2025-05-26 11:44:13 -07:00
Lifu Huang
0d503090aa Supported precomputed feature for Kimi VL (#6599) 2025-05-26 01:24:13 -07:00
Yi Zhang
f9bab3d591 qwen3moe support two batch overlap (#6598) 2025-05-25 23:08:16 -07:00
Chang Su
16f69b1f65 feat: Improve Mistral and Qwen25 function call parsing (#6597) 2025-05-25 23:07:23 -07:00
Yi Zhang
65f091310c refactor qwen moe code, use communicator to support tp+dp (#6581) 2025-05-25 23:01:10 -07:00
Yineng Zhang
7eb9d8e594 chore: upgrade transformers 4.52.3 (#6575)
Co-authored-by: Mick <mickjagger19@icloud.com>
2025-05-25 22:49:58 -07:00
fzyzcjy
a191a0e47c Improve performance of two batch overlap in some imbalanced cases (#6593) 2025-05-25 22:36:18 -07:00
Shenggui Li
3f23d8cdf1 added support for tied weights in qwen pipeline parallelism (#6546) 2025-05-25 00:00:56 -07:00
Lifu Huang
022012aae8 Support Phi-4 Multi-Modal (text + vision only) (#6494) 2025-05-24 21:43:38 -07:00
Xinyuan Tong
681fdc264b Refactor vlm embedding routine to use precomputed feature (#6543)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-24 18:39:21 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
kk
7a5e6ce1cb Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
2025-05-24 16:38:39 -07:00
Byron Hsu
2d831c6ef9 [PD] Support structured output (#6560) 2025-05-23 21:49:00 -07:00
Chang Su
ed0c3035cd feat(Tool Calling): Support required and specific function mode (#6550) 2025-05-23 21:00:37 -07:00
Shi Shuai
9c574585b3 fix: remove content=none test when tool called (#6347) 2025-05-23 15:12:55 -07:00
Byron Hsu
8233cc10fd [PD] Support logprob & Add failure test (#6558) 2025-05-23 14:29:20 -07:00
Byron Hsu
d2e0881a34 [PD] support spec decode (#6507)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-23 12:03:05 -07:00
YanbingJiang
d8189660a9 Update sgl-kernel UTs for activation/topk/norm/rope kernels (#6452) 2025-05-23 02:03:15 -07:00
Chunyuan WU
3ded6235c9 Add fp8 fused_experts kernel for CPU in sgl-kernel and add UT (#6404) 2025-05-23 02:01:55 -07:00
blzheng
4ba1eea83f Add fp8 qkv_proj_with_rope kernel for CPU in sgl-kernel and add UT (#6493) 2025-05-23 00:14:46 -07:00
Chang Su
4685fbb888 [VLM] Support chunk prefill for VLM (#6355)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-22 20:32:41 -07:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00
Yineng Zhang
0b07c4a99f chore: upgrade sgl-kernel v0.1.4 (#6532) 2025-05-22 13:28:16 -07:00
fzyzcjy
7a80f56513 Support dynamically rebalancing experts using EPLB (#6469) 2025-05-21 23:13:21 -07:00
fzyzcjy
fc992a09f9 Support updating expert locations dynamically (#6388) 2025-05-21 21:59:33 -07:00
Ke Bao
6ce0ed073b Apply constraint grammar to EAGLE (#6499)
Co-authored-by: merrymercy <lianminzheng@gmail.com>
2025-05-21 17:18:41 -07:00
blzheng
cfe48c5902 [CPU] Fix build issue (#6419) 2025-05-21 11:17:10 -07:00
Jiajun Li
4024e1d2a8 Implement Siglip Vision model, and support BNB quantization for gemma3-mm (#5339) 2025-05-20 23:53:46 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
YanbingJiang
32cc66efa5 Update extend/decode attention kernel for CPU in sgl-kernel and add UTs (#6405)
Co-authored-by: mingfeima <mingfei.ma@intel.com>
2025-05-19 21:23:17 -07:00
fzyzcjy
f0653886a5 Expert distribution recording without overhead for EPLB (#4957) 2025-05-19 20:07:43 -07:00
Yineng Zhang
b146555749 Revert "Implement return_hidden_states for the OpenAI API (#6137)" (#6440) 2025-05-19 18:21:29 -07:00
Trevor Morris
7adf245ba2 [Metrics] Add KV events publishing (#6098) 2025-05-19 14:19:54 -07:00
Baizhou Zhang
299fd22f9e Fix throughput threshold for amd ci test (#6414) 2025-05-19 14:17:41 -07:00