Commit Graph

134 Commits

Author SHA1 Message Date
Vishwanath Venkatesan
2cd2e27f80 SGLang HiCache NIXL Connector (#8488)
Signed-off-by: Vishwanath Venkatesan <vvenkatesan@nvidia.com>
Co-authored-by: Moein Khazraee <moein@nvidia.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-07-31 13:09:42 -07:00
Ke Bao
3c307dc057 Fix hf3fs_fuse import error (#8623) 2025-07-31 22:42:31 +08:00
huangtingwei
d904959233 Support l3 cache (mooncake store) for hiradix cache (#7211)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: zuoyuan <zhangzuo21@mails.tsinghua.edu.cn>
Co-authored-by: @wangyueneng.wyn <wangyueneng.wyn@antgroup.com>
Co-authored-by: JinYan Su <jinyansu792@gmail.com>
2025-07-30 23:15:51 -07:00
huangtingwei
26c8a310bd fix incorrect increase of hit count (#8533)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-07-31 06:02:42 +00:00
yi wang
5963e50503 [bugfix] Fix 2 minor bugs in the hicache storage layer (#8404) 2025-07-31 05:47:14 +00:00
pansicheng
299803343d Add hf3fs support for hicache storage (based on #7704) (#7280)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-07-30 17:42:41 -07:00
hzh0425
a85ebf50b8 feat(hicache): support file backend reading directory config form env. (#8498) 2025-07-29 21:18:46 -07:00
Zhiqiang Xie
528bd1ed85 HiCache, check before terminate prefetching (#8372) 2025-07-26 23:13:16 -07:00
Zhiqiang Xie
145482f422 HiCache Storage TP Refinement (#8307)
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
2025-07-25 08:31:47 +08:00
YiXR
a99801e075 [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (#8133)
Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
Co-authored-by: Xingrui Yi <yixingrui@linux.alibaba.com>
2025-07-23 13:28:12 -07:00
Zhiqiang Xie
9d33fcfb8e Hicache Storage Layer Prototype (#7704) 2025-07-18 15:20:19 +08:00
Ziqi Fan
01857fab61 fix: update HostKVCache init to report correct msg when available memory is not enough (#8102) 2025-07-17 21:24:34 +08:00
hzh0425
7c39e8a198 Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (#7982) 2025-07-14 14:57:10 -07:00
Hanming Lu
9379da77de SWA Prefix Cache (#7367)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-07-13 12:31:07 -07:00
Ying Sheng
bcc5ba94b4 [minor fix] SWA missing methods (#7972) 2025-07-11 23:57:02 -07:00
Ying Sheng
cee9f329c4 [minor fix] llama4 hybrid memory (#7950) 2025-07-11 23:11:36 -07:00
ronnie_zheng
86044712c6 [feature] kv transfer support of ascend npu (#7795)
Co-authored-by: liupeng <liupeng374@huawei.com>
2025-07-11 00:07:51 -07:00
ronnie_zheng
766392c6bd [feature]Ascend quantization support (#7791)
Co-authored-by: ichernob <ichernobnn@gmail.com>
Co-authored-by: liupeng <liupeng374@huawei.com>
2025-07-10 09:17:37 -07:00
Zhiqiang Xie
2fc824b84c Kernels for efficient KV cache IO (#7313) 2025-07-06 22:53:36 -07:00
Lianmin Zheng
14229ccf8f Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups (#7748) 2025-07-04 16:33:33 -07:00
ronnie_zheng
1e0e549766 Ascend attention backend(PA&MLA) (#7722)
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
2025-07-03 09:23:19 -07:00
Lianmin Zheng
22352d47a9 Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-06-29 23:16:19 -07:00
tarinkk
eb6c2c1663 Hybrid kv cache for LLaMA4 (#6563)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-06-27 18:58:55 -07:00
Liangsheng Yin
05c9bc8956 [minor] simplify the TokenToKVPoolAllocator (#7414) 2025-06-22 12:37:18 +08:00
Liangsheng Yin
5ea5d22170 Fix CPU offloading for MLA memory pool (#7409) 2025-06-22 02:39:05 +08:00
Shangming Cai
187b85b7f3 [PD] Optimize custom mem pool usage and bump mooncake version (#7393)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-20 09:50:39 -07:00
DarkSharpness
47367b768d [Refactor] Clean up radix cache related API (#7303)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-06-20 00:58:48 +08:00
Stefan He
3774f07825 Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099) 2025-06-19 00:56:37 -07:00
Zhiqiang Xie
e56685ac1b Upstreaming hicache bug fixes (#7267) 2025-06-17 17:44:57 -07:00
shangmingc
c26d7349d3 [PD] Add custom memory pool option to support Mooncake PD with NVLink (#7264)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-17 17:21:37 -07:00
Lianmin Zheng
b1286a116a [EAGLE] Refactor code for page size > 1 & more simplifications (#7213) 2025-06-16 03:04:29 -07:00
Baizhou Zhang
d2679f5109 Fix ChunkCache object has no attribute 'disable' (#7217) 2025-06-15 20:55:15 -07:00
Lianmin Zheng
fff10809bf Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210) 2025-06-15 02:48:00 -07:00
Lianmin Zheng
5f1ab32717 [EAGLE] Refactor code for page size > 1 & more simplifications (#7163) 2025-06-14 23:16:23 -07:00
Lianmin Zheng
38af4f68a9 Fix grammar abort & Minor style fixes (#7204) 2025-06-14 22:49:41 -07:00
Lianmin Zheng
a6305c7d50 Lianmin/simplify memory pool (#7202) 2025-06-14 22:25:37 -07:00
Lianmin Zheng
a023856b12 Move host memory pools into a separate file (#7200) 2025-06-14 21:31:42 -07:00
Byron Hsu
db0cc57e75 [PD] Support decode retract and update decode.py (#7196) 2025-06-14 19:48:05 -07:00
Faradawn Yang
777688b892 [feat]: Emit fixed-size KV blocks events (#6824) 2025-06-11 13:07:58 -07:00
sogalin
02543b545c Fix misusing the "_is_cuda". (#7091) 2025-06-11 11:21:31 -07:00
Chang Su
4685fbb888 [VLM] Support chunk prefill for VLM (#6355)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-22 20:32:41 -07:00
Byron Hsu
0a4fc73b48 [PD] Fix failure abort (#6535) 2025-05-22 20:32:03 -07:00
Baizhou Zhang
d4c038daed [Fix]Fix capture fail bug for DeepSeek (#6275) 2025-05-21 11:11:20 -07:00
Lianmin Zheng
03886917bd Disable all two stream overlap on amd (#6475) 2025-05-20 19:06:59 -07:00
Trevor Morris
7adf245ba2 [Metrics] Add KV events publishing (#6098) 2025-05-19 14:19:54 -07:00
wangxiyu191
155214952b refactor: Extract repeated member variables in KVCache subclasses to base class. (#6323) 2025-05-18 15:28:15 -07:00
doujiang24
9d24c3ffb0 chore: tiny remove duplicated code (#6392)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
2025-05-18 02:17:32 -07:00
Lifu Huang
3cf1473a09 Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-17 16:49:18 -07:00
Simon (Jiyou) Li
b29a026e14 KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (#6016)
Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: chus-chus <chus-chus@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-05-09 15:50:06 -07:00
Zhiqiang Xie
f8e460930a Fix prefill OOM error in the case of large page size (#5081) 2025-05-05 16:02:55 -07:00