Huapeng Zhou
|
2f7420bc84
|
[Feat] Enable PDL automatically on Hopper architecture (#5981)
|
2025-06-01 12:30:17 -07:00 |
|
Ravi Theja
|
c6a0cacc35
|
Update CI tests for Llama4 models (#6421)
|
2025-06-01 11:52:15 +08:00 |
|
Lifu Huang
|
0a9bfc20ab
|
[Minor] Always append newline after image token when parsing chat message (#6797)
|
2025-05-31 20:50:33 -07:00 |
|
Yineng Zhang
|
34c63731fc
|
chore: upgrade sgl-kernel v0.1.5 (#6795)
|
2025-05-31 18:32:00 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|
Yineng Zhang
|
b520d02888
|
chore: bump sgl-kernel v0.1.5 (#6794)
|
2025-05-31 14:54:00 -07:00 |
|
Qiaolin Yu
|
7dc0e39442
|
Bump torch to 2.7.0 (#6788)
|
2025-05-31 14:43:12 -07:00 |
|
Yikai Zhang
|
fb507b7b10
|
[FIX] mmmu bench serving result display error (#6525) (#6791)
|
2025-05-31 13:48:06 -07:00 |
|
storyicon
|
f90945c45a
|
fix(PD-disaggregation): Can not get local ip (#6792)
Signed-off-by: storyicon <storyicon@foxmail.com>
|
2025-05-31 13:47:14 -07:00 |
|
Lifu Huang
|
094fbdacd5
|
Fix incorrect LoRA weight loading for fused gate_up_proj (#6734)
|
2025-05-31 13:41:44 -07:00 |
|
YanbingJiang
|
888cb175a6
|
Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
|
2025-05-30 21:37:42 -07:00 |
|
Chang Su
|
e39bca0756
|
ci: relax test_function_call_required (#6786)
|
2025-05-30 19:18:42 -07:00 |
|
Jianan Ji
|
a2bb856543
|
Temporarily lower mmlu threshold for triton sliding window backend (#6785)
|
2025-05-30 18:40:50 -07:00 |
|
Cheng Wan
|
ced3c07afe
|
Support token-level quantization for EP MoE (#6782)
|
2025-05-30 17:26:30 -07:00 |
|
Chang Su
|
f18b068f15
|
feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream (#6784)
|
2025-05-30 17:05:17 -07:00 |
|
Chao Yang
|
4fac524b14
|
update llama4 chat template and pythonic parser (#6679)
Co-authored-by: Chang Su <chang.s.su@oracle.com>
|
2025-05-30 17:01:22 -07:00 |
|
Cheng Wan
|
b581b22504
|
Fix one bug in the grouped-gemm triton kernel (#6772)
|
2025-05-30 01:42:08 -07:00 |
|
Li Hui
|
69dd878b51
|
Fix shared experts fusion error (#6289)
|
2025-05-30 01:16:11 -07:00 |
|
Jianan Ji
|
22630ca242
|
Support sliding window in triton backend (#6509)
|
2025-05-30 01:11:53 -07:00 |
|
Yuhong Guo
|
d279d4990c
|
Fix aiohttp 'Chunk too big' in bench_serving (#6737)
|
2025-05-30 00:50:36 -07:00 |
|
shangmingc
|
6cb00c6398
|
[PD] Optimize time out logic and add env var doc for mooncake (#6761)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-30 00:45:02 -07:00 |
|
Xu Wenqing
|
62cac2c43a
|
Update DeepSeek-R1-0528 function call chat template (#6765)
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
|
2025-05-30 00:42:57 -07:00 |
|
fzyzcjy
|
2c3b71d678
|
Improve EPLB logical to physical dispatch map (#6727)
|
2025-05-29 19:23:54 -07:00 |
|
Zilin Zhu
|
51cdd81f97
|
[fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265)
|
2025-05-29 16:28:10 -07:00 |
|
Baizhou Zhang
|
73def253b5
|
Fix mem_fraction_static for AMD CI (#6748)
|
2025-05-29 12:37:30 -07:00 |
|
JieXin Liang
|
d9d35def3d
|
[test] add ut and bm for get_last_loc (#6746)
|
2025-05-29 11:47:21 -07:00 |
|
fzyzcjy
|
6df81e8a39
|
Support tuning DeepEP configs (#6742)
|
2025-05-29 08:12:22 -07:00 |
|
fzyzcjy
|
3ab7d9b55e
|
Support picking variants of EPLB algorithms (#6728)
|
2025-05-29 08:12:01 -07:00 |
|
fzyzcjy
|
7e5071c92a
|
Super tiny enable sole usage of expert distribution metrics and update doc (#6680)
|
2025-05-29 08:11:38 -07:00 |
|
Liangsheng Yin
|
78689d3393
|
PD Rust LB (PO2) (#6437)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-05-29 20:50:10 +08:00 |
|
shangmingc
|
1dc6864f17
|
[PD] Support completion endpoint (#6729)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-29 16:26:18 +08:00 |
|
ChangyiYang
|
485a023bd8
|
refactor apply_w8a8_block_fp8_linear in fp (#6545)
|
2025-05-29 00:15:11 -07:00 |
|
Ke Bao
|
7e41290082
|
Add draft extend CUDA graph for Triton backend (#6705)
|
2025-05-29 00:13:07 -07:00 |
|
Chang Su
|
c673727e0e
|
refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor parse_streaming_increment (#6715)
|
2025-05-29 00:08:45 -07:00 |
|
Xu Wenqing
|
f4d4f93928
|
Add DeepSeek-R1-0528 function call chat template (#6725)
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
|
2025-05-29 00:05:07 -07:00 |
|
Baizhou Zhang
|
f2bd3515fb
|
Tune memory arguments on B200 (#6718)
|
2025-05-29 00:03:22 -07:00 |
|
dongmao zhang
|
c459536b0f
|
[PD] bug fix: Update status if nixl receiver send a a dummy req. (#6720)
|
2025-05-29 00:01:56 -07:00 |
|
JieXin Liang
|
535c838674
|
[fix] more mem for draft_extend cuda_graph (#6726)
|
2025-05-28 23:25:18 -07:00 |
|
JieXin Liang
|
2163586e63
|
[feat] triton kernel for get_last_loc (#6676)
|
2025-05-28 23:10:28 -07:00 |
|
iLeGend
|
e06b076105
|
Fix PP for Qwen3 MoE (#6709)
|
2025-05-28 23:06:18 -07:00 |
|
Wenxuan Tan
|
844a8f42c7
|
Fix LoRA bench (#6719)
|
2025-05-28 16:38:55 -07:00 |
|
Baizhou Zhang
|
791b3bfabb
|
[Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell (#6479)
|
2025-05-28 16:03:43 -07:00 |
|
fzyzcjy
|
31589e177e
|
Speed up when having padding tokens two-batch overlap (#6668)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-05-28 16:00:58 -07:00 |
|
fzyzcjy
|
ae6a5b2950
|
Minor refactor two-batch overlap (#6682)
|
2025-05-28 15:54:17 -07:00 |
|
fzyzcjy
|
4839999b76
|
Overlap two kernels in DeepSeek with communication (#6711)
|
2025-05-28 15:53:51 -07:00 |
|
fzyzcjy
|
541a985f85
|
Fuse routed_scaling_factor in DeepSeek (#6710)
|
2025-05-28 15:53:37 -07:00 |
|
Hongbo Xu
|
5170b010a6
|
[PD] Remove Unnecessary Exception Handling for FastQueue.get() (#6712)
|
2025-05-28 11:18:24 -07:00 |
|
shangmingc
|
d63e76f735
|
[CI] Fix setup of disaggregation with different tp (#6706)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 11:17:27 -07:00 |
|
shangmingc
|
e9fd11c0d1
|
[Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set (#6703)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 21:33:36 +08:00 |
|
shangmingc
|
c7588d593e
|
[Bugfix] Fix slice operation when chunk size mismatch (#6697)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 21:15:00 +08:00 |
|