Jianan Ji
|
22630ca242
|
Support sliding window in triton backend (#6509)
|
2025-05-30 01:11:53 -07:00 |
|
Yuhong Guo
|
d279d4990c
|
Fix aiohttp 'Chunk too big' in bench_serving (#6737)
|
2025-05-30 00:50:36 -07:00 |
|
shangmingc
|
6cb00c6398
|
[PD] Optimize time out logic and add env var doc for mooncake (#6761)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-30 00:45:02 -07:00 |
|
fzyzcjy
|
2c3b71d678
|
Improve EPLB logical to physical dispatch map (#6727)
|
2025-05-29 19:23:54 -07:00 |
|
Zilin Zhu
|
51cdd81f97
|
[fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265)
|
2025-05-29 16:28:10 -07:00 |
|
Baizhou Zhang
|
73def253b5
|
Fix mem_fraction_static for AMD CI (#6748)
|
2025-05-29 12:37:30 -07:00 |
|
fzyzcjy
|
3ab7d9b55e
|
Support picking variants of EPLB algorithms (#6728)
|
2025-05-29 08:12:01 -07:00 |
|
fzyzcjy
|
7e5071c92a
|
Super tiny enable sole usage of expert distribution metrics and update doc (#6680)
|
2025-05-29 08:11:38 -07:00 |
|
Liangsheng Yin
|
78689d3393
|
PD Rust LB (PO2) (#6437)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-05-29 20:50:10 +08:00 |
|
shangmingc
|
1dc6864f17
|
[PD] Support completion endpoint (#6729)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-29 16:26:18 +08:00 |
|
ChangyiYang
|
485a023bd8
|
refactor apply_w8a8_block_fp8_linear in fp (#6545)
|
2025-05-29 00:15:11 -07:00 |
|
Ke Bao
|
7e41290082
|
Add draft extend CUDA graph for Triton backend (#6705)
|
2025-05-29 00:13:07 -07:00 |
|
Chang Su
|
c673727e0e
|
refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor parse_streaming_increment (#6715)
|
2025-05-29 00:08:45 -07:00 |
|
Baizhou Zhang
|
f2bd3515fb
|
Tune memory arguments on B200 (#6718)
|
2025-05-29 00:03:22 -07:00 |
|
dongmao zhang
|
c459536b0f
|
[PD] bug fix: Update status if nixl receiver send a a dummy req. (#6720)
|
2025-05-29 00:01:56 -07:00 |
|
JieXin Liang
|
535c838674
|
[fix] more mem for draft_extend cuda_graph (#6726)
|
2025-05-28 23:25:18 -07:00 |
|
JieXin Liang
|
2163586e63
|
[feat] triton kernel for get_last_loc (#6676)
|
2025-05-28 23:10:28 -07:00 |
|
iLeGend
|
e06b076105
|
Fix PP for Qwen3 MoE (#6709)
|
2025-05-28 23:06:18 -07:00 |
|
Baizhou Zhang
|
791b3bfabb
|
[Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell (#6479)
|
2025-05-28 16:03:43 -07:00 |
|
fzyzcjy
|
31589e177e
|
Speed up when having padding tokens two-batch overlap (#6668)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-05-28 16:00:58 -07:00 |
|
fzyzcjy
|
ae6a5b2950
|
Minor refactor two-batch overlap (#6682)
|
2025-05-28 15:54:17 -07:00 |
|
fzyzcjy
|
4839999b76
|
Overlap two kernels in DeepSeek with communication (#6711)
|
2025-05-28 15:53:51 -07:00 |
|
fzyzcjy
|
541a985f85
|
Fuse routed_scaling_factor in DeepSeek (#6710)
|
2025-05-28 15:53:37 -07:00 |
|
Hongbo Xu
|
5170b010a6
|
[PD] Remove Unnecessary Exception Handling for FastQueue.get() (#6712)
|
2025-05-28 11:18:24 -07:00 |
|
shangmingc
|
e9fd11c0d1
|
[Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set (#6703)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 21:33:36 +08:00 |
|
shangmingc
|
c7588d593e
|
[Bugfix] Fix slice operation when chunk size mismatch (#6697)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 21:15:00 +08:00 |
|
ybyang
|
6b231325b9
|
[PD Perf] replace Queue to FastQueue (#6649)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 01:37:51 -07:00 |
|
shangmingc
|
b1c8d4e9f3
|
[PD] Abort unbootstrapped prefill requests through timeout (#6685)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 00:40:54 -07:00 |
|
shangmingc
|
fba03b29e3
|
[Bugfix] Fix missing abort finish reason for PD with ChatCompletion (#6693)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-28 00:39:46 -07:00 |
|
Chang Su
|
461a730280
|
fix(deepseekv3): Fix DeepSeekV3Detector tool_index assignment and multi-tool call streaming support (#6655)
|
2025-05-28 00:22:53 -07:00 |
|
Yuan Luo
|
c087ddd686
|
Refine pre_reorder_triton_kernel slightly to improve performance (#6627)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-05-28 00:15:23 -07:00 |
|
Chang Su
|
41ba767f0c
|
feat: Add warnings for invalid tool_choice and UTs (#6582)
|
2025-05-27 16:53:19 -07:00 |
|
Chang Su
|
bdb962d755
|
fix(tool call): Fix tool_index in PythonicDetector and issues with mixed output in non-streaming (#6678)
|
2025-05-27 16:18:42 -07:00 |
|
fzyzcjy
|
87068b5cc7
|
Support gathering expert distribution details (#6665)
|
2025-05-27 15:32:59 -07:00 |
|
fzyzcjy
|
a564e001b5
|
Fix DeepEP error in Qwen 3 MoE models (#6673)
|
2025-05-27 15:12:54 -07:00 |
|
Trevor Morris
|
e806f708c9
|
[PD] Make bootstrap code common between NIXL and Mooncake (#6473)
|
2025-05-27 12:47:38 -07:00 |
|
Yineng Zhang
|
fa6723f08f
|
Revert "fix communicator for non-dp lm head (#6662)" (#6677)
|
2025-05-27 12:22:59 -07:00 |
|
fzyzcjy
|
673ff668f7
|
Speed up expert location update (#6661)
|
2025-05-27 10:00:09 -07:00 |
|
fzyzcjy
|
447be24228
|
Fix OOM when updating expert locations (#6660)
|
2025-05-27 09:59:53 -07:00 |
|
HAI
|
183d9f969c
|
DeepSeek: enable none block-quant FP8 quantizations (#6638)
|
2025-05-27 09:06:40 -07:00 |
|
Ke Bao
|
631950280a
|
Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
|
2025-05-27 02:35:17 -07:00 |
|
Cheng Wan
|
a3d7f4b673
|
fix communicator for non-dp lm head (#6662)
|
2025-05-27 02:31:12 -07:00 |
|
Yi Zhang
|
b18416fbf8
|
Fix qwen3 tbo/dp-lm-head (#6652)
|
2025-05-27 00:38:27 -07:00 |
|
Mick
|
ce9d690ef4
|
fix: fix nightly test from updating transformers (#6658)
|
2025-05-27 00:28:11 -07:00 |
|
Baizhou Zhang
|
bdaefbbfbd
|
Add environment flag for disabling message queue broadcaster (#6403)
|
2025-05-26 22:32:41 -07:00 |
|
Chang Su
|
ae33584235
|
[Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py (#6650)
|
2025-05-26 21:57:10 -07:00 |
|
Lifu Huang
|
477a101cbd
|
Refactor LoRA handling to support adapter tensors in fused format (#6585)
|
2025-05-26 21:51:54 -07:00 |
|
fzyzcjy
|
1a8f5f6836
|
Super tiny rename environment variable (#6648)
|
2025-05-26 21:01:16 -07:00 |
|
fzyzcjy
|
32cd707002
|
Support TP in attention for two batch overlap (#6634)
|
2025-05-26 20:28:12 -07:00 |
|
fzyzcjy
|
ebd1ed49d4
|
Tiny refactor communicator (#6646)
|
2025-05-26 20:24:17 -07:00 |
|