Lianmin Zheng
|
55381a46ac
|
Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (#8181)
|
2025-07-19 22:41:30 -07:00 |
|
ybyang
|
4540a4666a
|
[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (#8115)
Signed-off-by: ybyang <ybyang7@iflytek.com>
|
2025-07-19 18:10:00 -07:00 |
|
Lianmin Zheng
|
bb0e8a32b5
|
Clean up server args (#8161)
|
2025-07-19 11:32:52 -07:00 |
|
Sai Enduri
|
d0510f08fe
|
Revert "Fix different device type adjustment in PP" (#8141)
|
2025-07-18 01:12:11 -07:00 |
|
Zhiqiang Xie
|
9d33fcfb8e
|
Hicache Storage Layer Prototype (#7704)
|
2025-07-18 15:20:19 +08:00 |
|
Zhao Chen
|
3586b4cef2
|
feat: add production metric for retracted requests due to insufficient kvcache (#7030)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
|
2025-07-17 11:59:05 -07:00 |
|
Yingchun Lai
|
795668dc73
|
feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (#7597)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-07-16 17:55:59 -07:00 |
|
Qiaolin Yu
|
69f453e5a4
|
Use device_group for all_gather when disabling overlap scheduling (#8001)
|
2025-07-15 19:38:58 -07:00 |
|
Qiaolin Yu
|
3bc43c683e
|
Fix different device type adjustment in PP (#7760)
|
2025-07-15 19:37:14 -07:00 |
|
Hanming Lu
|
9379da77de
|
SWA Prefix Cache (#7367)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-07-13 12:31:07 -07:00 |
|
kyleliang-nv
|
dd445a41f5
|
[feature] Add start step profile argument in /start_profile (#7608)
|
2025-07-09 18:42:15 -07:00 |
|
Zhiqiang Xie
|
2fc824b84c
|
Kernels for efficient KV cache IO (#7313)
|
2025-07-06 22:53:36 -07:00 |
|
yuhsuan-t
|
8d4a01cbd7
|
Log the timestamps of each prefill/decode iteration (#6094)
Co-authored-by: yuhsuan-t <12108766+yuhsaun-t@users.noreply.github.com>
|
2025-07-07 01:57:27 +00:00 |
|
Nan Jiang
|
ba69c153f6
|
[RL]: Fix error tagging in multi-stage wake up (#7812)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-07-06 16:51:29 -07:00 |
|
Stefan He
|
3589aa79b0
|
[RL] Fix illegal memory for _import_static_state (#7733)
Co-authored-by: nanjiangwill <willjiang2018@gmail.com>
|
2025-07-06 16:25:21 -07:00 |
|
Cheng Wan
|
8fc910db03
|
DP Attention with Auto DeepEP Dispatch (#7222)
|
2025-07-05 01:54:24 -07:00 |
|
Lianmin Zheng
|
14229ccf8f
|
Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups (#7748)
|
2025-07-04 16:33:33 -07:00 |
|
TianyuZhang1214
|
0099172327
|
feat: use D2D instead of H2H in pp (#7673)
Co-authored-by: alpha-baby <fujianhao1997@qq.com>
|
2025-07-03 10:58:50 -07:00 |
|
Chunyuan WU
|
1dce6c480f
|
[CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (#6771)
|
2025-07-03 09:51:38 -07:00 |
|
Ziming Huang
|
1bebd3154e
|
Fix num_tokens_pre_allocated in disaggregation log (#7714)
|
2025-07-02 22:31:49 -07:00 |
|
Albert
|
d3c275b117
|
Support updating weights at once by stopping all requests (#6698)
Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com>
Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>
|
2025-07-02 22:26:06 -07:00 |
|
Zilin Zhu
|
0626f678de
|
[RL] support update_weights_from_distributed with different group and multiple weights (#7292)
|
2025-07-02 19:29:11 -07:00 |
|
Lianmin Zheng
|
22352d47a9
|
Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
|
2025-06-29 23:16:19 -07:00 |
|
fzyzcjy
|
0c9c6c75a8
|
Move files related to EPLB (#7580)
|
2025-06-29 15:39:38 -07:00 |
|
Lifu Huang
|
49538d111b
|
Support dynamic LoRA loading / unloading in engine/server API (#7446)
|
2025-06-27 21:00:27 -07:00 |
|
tarinkk
|
eb6c2c1663
|
Hybrid kv cache for LLaMA4 (#6563)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-06-27 18:58:55 -07:00 |
|
Stefan He
|
00fbd8a484
|
Fix typo of flash_cache (#7513)
|
2025-06-25 02:04:41 -07:00 |
|
zixuanzhang226
|
f3cbd24541
|
feat: send kvmetrics from sglang scheduler (#6721)
|
2025-06-25 01:57:49 -07:00 |
|
DangKai
|
bc2e5645c4
|
fix: force synchronization between TP workers when update_weights (#6626)
Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>
|
2025-06-25 01:35:59 -07:00 |
|
u4lr451
|
ed0a0b692c
|
Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-23 17:34:13 -07:00 |
|
Lianmin Zheng
|
55e03b10c4
|
Fix a bug in BatchTokenIDOut & Misc style and dependency updates (#7457)
|
2025-06-23 06:20:39 -07:00 |
|
fzyzcjy
|
edc21cc8ae
|
Tiny add logging for GC (#7406)
|
2025-06-22 12:40:02 +08:00 |
|
Liangsheng Yin
|
05c9bc8956
|
[minor] simplify the TokenToKVPoolAllocator (#7414)
|
2025-06-22 12:37:18 +08:00 |
|
Cheng Wan
|
5041df2d01
|
Fix 7285 Merge Conflicts (#7403)
|
2025-06-20 16:02:50 -07:00 |
|
Cheng Wan
|
73b13e69b4
|
Optimize DP attn scheduling for speculative decoding (#7285)
|
2025-06-20 15:06:41 -07:00 |
|
Cheng Wan
|
e879d8b7a8
|
[Feature] Comprehensive Hybrid Parallelism Support (#6389)
|
2025-06-20 14:43:11 -07:00 |
|
strgrb
|
ceba0ce4f6
|
support return logprobs for pipeline (#7356)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
|
2025-06-19 23:50:45 -07:00 |
|
Huang Long
|
1d6515ef2a
|
[Bugfix]Fix hang bug using dp attention with HiRadixCache (#7159)
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
|
2025-06-19 20:34:36 -07:00 |
|
Atream
|
4f838c09cd
|
[PD] Transfer hidden states for mtp when disaggregation (#7242)
|
2025-06-19 11:22:47 -07:00 |
|
DarkSharpness
|
47367b768d
|
[Refactor] Clean up radix cache related API (#7303)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-06-20 00:58:48 +08:00 |
|
Stefan He
|
3774f07825
|
Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099)
|
2025-06-19 00:56:37 -07:00 |
|
fzyzcjy
|
9c6a0656a3
|
Fix profiler error when there are idle passes (#7003)
|
2025-06-18 10:55:01 -07:00 |
|
Zhiqiang Xie
|
e56685ac1b
|
Upstreaming hicache bug fixes (#7267)
|
2025-06-17 17:44:57 -07:00 |
|
shangmingc
|
c26d7349d3
|
[PD] Add custom memory pool option to support Mooncake PD with NVLink (#7264)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-06-17 17:21:37 -07:00 |
|
u4lr451
|
10d60cd41b
|
feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-17 00:33:28 -07:00 |
|
woodx
|
e30ef368ab
|
Feat/support rerank (#6058)
|
2025-06-16 10:50:01 -07:00 |
|
Liangsheng Yin
|
c494386728
|
minor fix (#7245)
|
2025-06-16 23:30:26 +08:00 |
|
Byron Hsu
|
88f9c347b2
|
[PD] use int32 for kv indices & get num_reserved_decode_tokens from server_args (#7214)
|
2025-06-15 11:51:03 -07:00 |
|
Lianmin Zheng
|
38af4f68a9
|
Fix grammar abort & Minor style fixes (#7204)
|
2025-06-14 22:49:41 -07:00 |
|
Byron Hsu
|
db0cc57e75
|
[PD] Support decode retract and update decode.py (#7196)
|
2025-06-14 19:48:05 -07:00 |
|