DarkSharpness
|
e273aa6dcf
|
[Feature] Radix Tree in C++ (#7369)
|
2025-08-02 19:50:14 -07:00 |
|
Cheng Wan
|
6c88f6c8d9
|
[5/N] MoE Refactor: Update MoE parallelism arguments (#8658)
|
2025-08-01 01:20:03 -07:00 |
|
Zhiqiang Xie
|
dd7ca00601
|
Interface change for kvcache io to support page first layout (#8318)
|
2025-08-01 11:37:49 +08:00 |
|
Cheng Wan
|
7a1f7fc504
|
[Feature] Hybrid EP and TP (#8590)
|
2025-07-31 02:53:25 -07:00 |
|
hzh0425
|
2fbb754e1d
|
feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (#8516)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-07-29 21:19:25 -07:00 |
|
Lianmin Zheng
|
a4c3b121d8
|
Split the scheduler into multiple mixin classes to reduce the file size (#8483)
|
2025-07-29 12:46:50 -07:00 |
|
fzyzcjy
|
0ce84c822b
|
Support colocating requests (#7973)
|
2025-07-28 22:51:49 -07:00 |
|
harrisonlimh
|
747dd45077
|
feat: throttle requests at scheduler based on --max_queued_requests (#7565)
|
2025-07-28 22:32:33 +08:00 |
|
Shangming Cai
|
2fd5c7049f
|
[PD] Fix abort_request for PD disaggregation (#8352)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
|
2025-07-27 21:48:27 -07:00 |
|
Chang Su
|
dd487e5553
|
bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (#8422)
|
2025-07-28 10:01:02 +08:00 |
|
Yineng Zhang
|
f8ca2368b2
|
fix: kimi k2 xgrammar crash (#8367)
Co-authored-by: cicirori <32845984+cicirori@users.noreply.github.com>
Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com>
|
2025-07-25 15:44:01 -07:00 |
|
Lianmin Zheng
|
ed2e313eb6
|
Clean up server_args, triton cache manager (#8332)
|
2025-07-25 14:14:51 -07:00 |
|
Lifu Huang
|
8abd3e77fe
|
Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (#8261)
|
2025-07-23 00:32:16 -07:00 |
|
Lianmin Zheng
|
55381a46ac
|
Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (#8181)
|
2025-07-19 22:41:30 -07:00 |
|
ybyang
|
4540a4666a
|
[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (#8115)
Signed-off-by: ybyang <ybyang7@iflytek.com>
|
2025-07-19 18:10:00 -07:00 |
|
Lianmin Zheng
|
bb0e8a32b5
|
Clean up server args (#8161)
|
2025-07-19 11:32:52 -07:00 |
|
Sai Enduri
|
d0510f08fe
|
Revert "Fix different device type adjustment in PP" (#8141)
|
2025-07-18 01:12:11 -07:00 |
|
Zhiqiang Xie
|
9d33fcfb8e
|
Hicache Storage Layer Prototype (#7704)
|
2025-07-18 15:20:19 +08:00 |
|
Zhao Chen
|
3586b4cef2
|
feat: add production metric for retracted requests due to insufficient kvcache (#7030)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
|
2025-07-17 11:59:05 -07:00 |
|
Yingchun Lai
|
795668dc73
|
feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (#7597)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-07-16 17:55:59 -07:00 |
|
Qiaolin Yu
|
69f453e5a4
|
Use device_group for all_gather when disabling overlap scheduling (#8001)
|
2025-07-15 19:38:58 -07:00 |
|
Qiaolin Yu
|
3bc43c683e
|
Fix different device type adjustment in PP (#7760)
|
2025-07-15 19:37:14 -07:00 |
|
Hanming Lu
|
9379da77de
|
SWA Prefix Cache (#7367)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-07-13 12:31:07 -07:00 |
|
kyleliang-nv
|
dd445a41f5
|
[feature] Add start step profile argument in /start_profile (#7608)
|
2025-07-09 18:42:15 -07:00 |
|
Zhiqiang Xie
|
2fc824b84c
|
Kernels for efficient KV cache IO (#7313)
|
2025-07-06 22:53:36 -07:00 |
|
yuhsuan-t
|
8d4a01cbd7
|
Log the timestamps of each prefill/decode iteration (#6094)
Co-authored-by: yuhsuan-t <12108766+yuhsaun-t@users.noreply.github.com>
|
2025-07-07 01:57:27 +00:00 |
|
Nan Jiang
|
ba69c153f6
|
[RL]: Fix error tagging in multi-stage wake up (#7812)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-07-06 16:51:29 -07:00 |
|
Stefan He
|
3589aa79b0
|
[RL] Fix illegal memory for _import_static_state (#7733)
Co-authored-by: nanjiangwill <willjiang2018@gmail.com>
|
2025-07-06 16:25:21 -07:00 |
|
Cheng Wan
|
8fc910db03
|
DP Attention with Auto DeepEP Dispatch (#7222)
|
2025-07-05 01:54:24 -07:00 |
|
Lianmin Zheng
|
14229ccf8f
|
Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups (#7748)
|
2025-07-04 16:33:33 -07:00 |
|
TianyuZhang1214
|
0099172327
|
feat: use D2D instead of H2H in pp (#7673)
Co-authored-by: alpha-baby <fujianhao1997@qq.com>
|
2025-07-03 10:58:50 -07:00 |
|
Chunyuan WU
|
1dce6c480f
|
[CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (#6771)
|
2025-07-03 09:51:38 -07:00 |
|
Ziming Huang
|
1bebd3154e
|
Fix num_tokens_pre_allocated in disaggregation log (#7714)
|
2025-07-02 22:31:49 -07:00 |
|
Albert
|
d3c275b117
|
Support updating weights at once by stopping all requests (#6698)
Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com>
Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>
|
2025-07-02 22:26:06 -07:00 |
|
Zilin Zhu
|
0626f678de
|
[RL] support update_weights_from_distributed with different group and multiple weights (#7292)
|
2025-07-02 19:29:11 -07:00 |
|
Lianmin Zheng
|
22352d47a9
|
Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
|
2025-06-29 23:16:19 -07:00 |
|
fzyzcjy
|
0c9c6c75a8
|
Move files related to EPLB (#7580)
|
2025-06-29 15:39:38 -07:00 |
|
Lifu Huang
|
49538d111b
|
Support dynamic LoRA loading / unloading in engine/server API (#7446)
|
2025-06-27 21:00:27 -07:00 |
|
tarinkk
|
eb6c2c1663
|
Hybrid kv cache for LLaMA4 (#6563)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-06-27 18:58:55 -07:00 |
|
Stefan He
|
00fbd8a484
|
Fix typo of flash_cache (#7513)
|
2025-06-25 02:04:41 -07:00 |
|
zixuanzhang226
|
f3cbd24541
|
feat: send kvmetrics from sglang scheduler (#6721)
|
2025-06-25 01:57:49 -07:00 |
|
DangKai
|
bc2e5645c4
|
fix: force synchronization between TP workers when update_weights (#6626)
Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>
|
2025-06-25 01:35:59 -07:00 |
|
u4lr451
|
ed0a0b692c
|
Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-23 17:34:13 -07:00 |
|
Lianmin Zheng
|
55e03b10c4
|
Fix a bug in BatchTokenIDOut & Misc style and dependency updates (#7457)
|
2025-06-23 06:20:39 -07:00 |
|
fzyzcjy
|
edc21cc8ae
|
Tiny add logging for GC (#7406)
|
2025-06-22 12:40:02 +08:00 |
|
Liangsheng Yin
|
05c9bc8956
|
[minor] simplify the TokenToKVPoolAllocator (#7414)
|
2025-06-22 12:37:18 +08:00 |
|
Cheng Wan
|
5041df2d01
|
Fix 7285 Merge Conflicts (#7403)
|
2025-06-20 16:02:50 -07:00 |
|
Cheng Wan
|
73b13e69b4
|
Optimize DP attn scheduling for speculative decoding (#7285)
|
2025-06-20 15:06:41 -07:00 |
|
Cheng Wan
|
e879d8b7a8
|
[Feature] Comprehensive Hybrid Parallelism Support (#6389)
|
2025-06-20 14:43:11 -07:00 |
|
strgrb
|
ceba0ce4f6
|
support return logprobs for pipeline (#7356)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
|
2025-06-19 23:50:45 -07:00 |
|