Commit Graph

329 Commits

Author SHA1 Message Date
Lianmin Zheng
55381a46ac Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (#8181) 2025-07-19 22:41:30 -07:00
ybyang
4540a4666a [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (#8115)
Signed-off-by: ybyang <ybyang7@iflytek.com>
2025-07-19 18:10:00 -07:00
Lianmin Zheng
bb0e8a32b5 Clean up server args (#8161) 2025-07-19 11:32:52 -07:00
Sai Enduri
d0510f08fe Revert "Fix different device type adjustment in PP" (#8141) 2025-07-18 01:12:11 -07:00
Zhiqiang Xie
9d33fcfb8e Hicache Storage Layer Prototype (#7704) 2025-07-18 15:20:19 +08:00
Zhao Chen
3586b4cef2 feat: add production metric for retracted requests due to insufficient kvcache (#7030)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
2025-07-17 11:59:05 -07:00
Yingchun Lai
795668dc73 feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (#7597)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-07-16 17:55:59 -07:00
Qiaolin Yu
69f453e5a4 Use device_group for all_gather when disabling overlap scheduling (#8001) 2025-07-15 19:38:58 -07:00
Qiaolin Yu
3bc43c683e Fix different device type adjustment in PP (#7760) 2025-07-15 19:37:14 -07:00
Hanming Lu
9379da77de SWA Prefix Cache (#7367)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-07-13 12:31:07 -07:00
kyleliang-nv
dd445a41f5 [feature] Add start step profile argument in /start_profile (#7608) 2025-07-09 18:42:15 -07:00
Zhiqiang Xie
2fc824b84c Kernels for efficient KV cache IO (#7313) 2025-07-06 22:53:36 -07:00
yuhsuan-t
8d4a01cbd7 Log the timestamps of each prefill/decode iteration (#6094)
Co-authored-by: yuhsuan-t <12108766+yuhsaun-t@users.noreply.github.com>
2025-07-07 01:57:27 +00:00
Nan Jiang
ba69c153f6 [RL]: Fix error tagging in multi-stage wake up (#7812)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-07-06 16:51:29 -07:00
Stefan He
3589aa79b0 [RL] Fix illegal memory for _import_static_state (#7733)
Co-authored-by: nanjiangwill <willjiang2018@gmail.com>
2025-07-06 16:25:21 -07:00
Cheng Wan
8fc910db03 DP Attention with Auto DeepEP Dispatch (#7222) 2025-07-05 01:54:24 -07:00
Lianmin Zheng
14229ccf8f Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups (#7748) 2025-07-04 16:33:33 -07:00
TianyuZhang1214
0099172327 feat: use D2D instead of H2H in pp (#7673)
Co-authored-by: alpha-baby <fujianhao1997@qq.com>
2025-07-03 10:58:50 -07:00
Chunyuan WU
1dce6c480f [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (#6771) 2025-07-03 09:51:38 -07:00
Ziming Huang
1bebd3154e Fix num_tokens_pre_allocated in disaggregation log (#7714) 2025-07-02 22:31:49 -07:00
Albert
d3c275b117 Support updating weights at once by stopping all requests (#6698)
Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com>
Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>
2025-07-02 22:26:06 -07:00
Zilin Zhu
0626f678de [RL] support update_weights_from_distributed with different group and multiple weights (#7292) 2025-07-02 19:29:11 -07:00
Lianmin Zheng
22352d47a9 Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-06-29 23:16:19 -07:00
fzyzcjy
0c9c6c75a8 Move files related to EPLB (#7580) 2025-06-29 15:39:38 -07:00
Lifu Huang
49538d111b Support dynamic LoRA loading / unloading in engine/server API (#7446) 2025-06-27 21:00:27 -07:00
tarinkk
eb6c2c1663 Hybrid kv cache for LLaMA4 (#6563)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-06-27 18:58:55 -07:00
Stefan He
00fbd8a484 Fix typo of flash_cache (#7513) 2025-06-25 02:04:41 -07:00
zixuanzhang226
f3cbd24541 feat: send kvmetrics from sglang scheduler (#6721) 2025-06-25 01:57:49 -07:00
DangKai
bc2e5645c4 fix: force synchronization between TP workers when update_weights (#6626)
Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>
2025-06-25 01:35:59 -07:00
u4lr451
ed0a0b692c Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-23 17:34:13 -07:00
Lianmin Zheng
55e03b10c4 Fix a bug in BatchTokenIDOut & Misc style and dependency updates (#7457) 2025-06-23 06:20:39 -07:00
fzyzcjy
edc21cc8ae Tiny add logging for GC (#7406) 2025-06-22 12:40:02 +08:00
Liangsheng Yin
05c9bc8956 [minor] simplify the TokenToKVPoolAllocator (#7414) 2025-06-22 12:37:18 +08:00
Cheng Wan
5041df2d01 Fix 7285 Merge Conflicts (#7403) 2025-06-20 16:02:50 -07:00
Cheng Wan
73b13e69b4 Optimize DP attn scheduling for speculative decoding (#7285) 2025-06-20 15:06:41 -07:00
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
strgrb
ceba0ce4f6 support return logprobs for pipeline (#7356)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
2025-06-19 23:50:45 -07:00
Huang Long
1d6515ef2a [Bugfix]Fix hang bug using dp attention with HiRadixCache (#7159)
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
2025-06-19 20:34:36 -07:00
Atream
4f838c09cd [PD] Transfer hidden states for mtp when disaggregation (#7242) 2025-06-19 11:22:47 -07:00
DarkSharpness
47367b768d [Refactor] Clean up radix cache related API (#7303)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-06-20 00:58:48 +08:00
Stefan He
3774f07825 Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099) 2025-06-19 00:56:37 -07:00
fzyzcjy
9c6a0656a3 Fix profiler error when there are idle passes (#7003) 2025-06-18 10:55:01 -07:00
Zhiqiang Xie
e56685ac1b Upstreaming hicache bug fixes (#7267) 2025-06-17 17:44:57 -07:00
shangmingc
c26d7349d3 [PD] Add custom memory pool option to support Mooncake PD with NVLink (#7264)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-17 17:21:37 -07:00
u4lr451
10d60cd41b feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-17 00:33:28 -07:00
woodx
e30ef368ab Feat/support rerank (#6058) 2025-06-16 10:50:01 -07:00
Liangsheng Yin
c494386728 minor fix (#7245) 2025-06-16 23:30:26 +08:00
Byron Hsu
88f9c347b2 [PD] use int32 for kv indices & get num_reserved_decode_tokens from server_args (#7214) 2025-06-15 11:51:03 -07:00
Lianmin Zheng
38af4f68a9 Fix grammar abort & Minor style fixes (#7204) 2025-06-14 22:49:41 -07:00
Byron Hsu
db0cc57e75 [PD] Support decode retract and update decode.py (#7196) 2025-06-14 19:48:05 -07:00