Marc Sun
|
37f1547587
|
[FEAT] Add transformers backend support (#5929)
|
2025-06-03 21:05:29 -07:00 |
|
Cheng Wan
|
8a5480528d
|
[Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735)
|
2025-06-03 17:48:24 -07:00 |
|
fzyzcjy
|
b6d0ce9f78
|
Minor add metrics to expert location updater (#6816)
|
2025-06-02 23:59:11 -07:00 |
|
fzyzcjy
|
0ea330ca34
|
Fix wrong weight reference in dynamic EPLB (#6818)
|
2025-06-02 23:26:04 -07:00 |
|
pansicheng
|
27e327b415
|
fix new_page_count_next_decode (#6671)
|
2025-06-02 22:48:52 -07:00 |
|
Pavani Majety
|
eb38c7d1ca
|
[1/2] Add Kernel support for Cutlass based Fused FP4 MoE (#6093)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-06-02 13:48:03 -07:00 |
|
fzyzcjy
|
df7f61ee7d
|
Speed up rebalancing when using non-static dispatch algorithms (#6812)
|
2025-06-02 11:18:17 -07:00 |
|
fzyzcjy
|
ef21729c1d
|
Fix profiles do not have consistent names (#6811)
|
2025-06-02 11:17:22 -07:00 |
|
fzyzcjy
|
f5159315b2
|
Add simple utility to dump tensors for debugging (#6815)
|
2025-06-02 11:15:31 -07:00 |
|
fzyzcjy
|
6d7b6696d4
|
Tiny fix EPLB assertion about rebalancing period and recorder window size (#6813)
|
2025-06-02 11:13:33 -07:00 |
|
fzyzcjy
|
6376b632eb
|
Tiny log prefill time (#6780)
|
2025-06-02 10:28:27 -07:00 |
|
fzyzcjy
|
e05e29d178
|
Refactor CustomOp to avoid confusing bugs (#5382)
|
2025-06-02 10:27:36 -07:00 |
|
Ke Bao
|
a2cb5913a0
|
Add draft extend CUDA graph for flashinfer backend (#6805)
|
2025-06-02 01:51:26 -07:00 |
|
Lianmin Zheng
|
20fd53b8f6
|
Correctly abort the failed grammar requests & Improve the handling of abort (#6803)
|
2025-06-01 19:00:07 -07:00 |
|
Baizhou Zhang
|
6a47b73024
|
Remove contiguous before Flashinfer groupwise fp8 gemm (#6804)
|
2025-06-01 18:30:54 -07:00 |
|
Lifu Huang
|
0a9bfc20ab
|
[Minor] Always append newline after image token when parsing chat message (#6797)
|
2025-05-31 20:50:33 -07:00 |
|
Yineng Zhang
|
34c63731fc
|
chore: upgrade sgl-kernel v0.1.5 (#6795)
|
2025-05-31 18:32:00 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|
Qiaolin Yu
|
7dc0e39442
|
Bump torch to 2.7.0 (#6788)
|
2025-05-31 14:43:12 -07:00 |
|
Yikai Zhang
|
fb507b7b10
|
[FIX] mmmu bench serving result display error (#6525) (#6791)
|
2025-05-31 13:48:06 -07:00 |
|
storyicon
|
f90945c45a
|
fix(PD-disaggregation): Can not get local ip (#6792)
Signed-off-by: storyicon <storyicon@foxmail.com>
|
2025-05-31 13:47:14 -07:00 |
|
Lifu Huang
|
094fbdacd5
|
Fix incorrect LoRA weight loading for fused gate_up_proj (#6734)
|
2025-05-31 13:41:44 -07:00 |
|
YanbingJiang
|
888cb175a6
|
Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
|
2025-05-30 21:37:42 -07:00 |
|
Cheng Wan
|
ced3c07afe
|
Support token-level quantization for EP MoE (#6782)
|
2025-05-30 17:26:30 -07:00 |
|
Chang Su
|
f18b068f15
|
feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream (#6784)
|
2025-05-30 17:05:17 -07:00 |
|
Chao Yang
|
4fac524b14
|
update llama4 chat template and pythonic parser (#6679)
Co-authored-by: Chang Su <chang.s.su@oracle.com>
|
2025-05-30 17:01:22 -07:00 |
|
Cheng Wan
|
b581b22504
|
Fix one bug in the grouped-gemm triton kernel (#6772)
|
2025-05-30 01:42:08 -07:00 |
|
Li Hui
|
69dd878b51
|
Fix shared experts fusion error (#6289)
|
2025-05-30 01:16:11 -07:00 |
|
Jianan Ji
|
22630ca242
|
Support sliding window in triton backend (#6509)
|
2025-05-30 01:11:53 -07:00 |
|
Yuhong Guo
|
d279d4990c
|
Fix aiohttp 'Chunk too big' in bench_serving (#6737)
|
2025-05-30 00:50:36 -07:00 |
|
shangmingc
|
6cb00c6398
|
[PD] Optimize time out logic and add env var doc for mooncake (#6761)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-30 00:45:02 -07:00 |
|
fzyzcjy
|
2c3b71d678
|
Improve EPLB logical to physical dispatch map (#6727)
|
2025-05-29 19:23:54 -07:00 |
|
Zilin Zhu
|
51cdd81f97
|
[fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265)
|
2025-05-29 16:28:10 -07:00 |
|
Baizhou Zhang
|
73def253b5
|
Fix mem_fraction_static for AMD CI (#6748)
|
2025-05-29 12:37:30 -07:00 |
|
fzyzcjy
|
3ab7d9b55e
|
Support picking variants of EPLB algorithms (#6728)
|
2025-05-29 08:12:01 -07:00 |
|
fzyzcjy
|
7e5071c92a
|
Super tiny enable sole usage of expert distribution metrics and update doc (#6680)
|
2025-05-29 08:11:38 -07:00 |
|
Liangsheng Yin
|
78689d3393
|
PD Rust LB (PO2) (#6437)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-05-29 20:50:10 +08:00 |
|
shangmingc
|
1dc6864f17
|
[PD] Support completion endpoint (#6729)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-29 16:26:18 +08:00 |
|
ChangyiYang
|
485a023bd8
|
refactor apply_w8a8_block_fp8_linear in fp (#6545)
|
2025-05-29 00:15:11 -07:00 |
|
Ke Bao
|
7e41290082
|
Add draft extend CUDA graph for Triton backend (#6705)
|
2025-05-29 00:13:07 -07:00 |
|
Chang Su
|
c673727e0e
|
refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor parse_streaming_increment (#6715)
|
2025-05-29 00:08:45 -07:00 |
|
Baizhou Zhang
|
f2bd3515fb
|
Tune memory arguments on B200 (#6718)
|
2025-05-29 00:03:22 -07:00 |
|
dongmao zhang
|
c459536b0f
|
[PD] bug fix: Update status if nixl receiver send a a dummy req. (#6720)
|
2025-05-29 00:01:56 -07:00 |
|
JieXin Liang
|
535c838674
|
[fix] more mem for draft_extend cuda_graph (#6726)
|
2025-05-28 23:25:18 -07:00 |
|
JieXin Liang
|
2163586e63
|
[feat] triton kernel for get_last_loc (#6676)
|
2025-05-28 23:10:28 -07:00 |
|
iLeGend
|
e06b076105
|
Fix PP for Qwen3 MoE (#6709)
|
2025-05-28 23:06:18 -07:00 |
|
Baizhou Zhang
|
791b3bfabb
|
[Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell (#6479)
|
2025-05-28 16:03:43 -07:00 |
|
fzyzcjy
|
31589e177e
|
Speed up when having padding tokens two-batch overlap (#6668)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-05-28 16:00:58 -07:00 |
|
fzyzcjy
|
ae6a5b2950
|
Minor refactor two-batch overlap (#6682)
|
2025-05-28 15:54:17 -07:00 |
|
fzyzcjy
|
4839999b76
|
Overlap two kernels in DeepSeek with communication (#6711)
|
2025-05-28 15:53:51 -07:00 |
|