Commit Graph

131 Commits

Author SHA1 Message Date
Zhiqiang Xie
54e872d343 [HiCache] resolve conflict between chunked-prefill and hicache hit count (#9776) 2025-08-30 01:30:54 +08:00
Xuchun Shang
e5b29bf14e [PD] Support get_model_info interface for mini_lb (#9792)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
2025-08-29 00:54:03 -07:00
chenxu140
74dd4249ac [Feature] Support NPUGraph for DeepSeek on Ascend NPU (#9355)
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
2025-08-28 16:06:24 -07:00
SCDESPERTATE
b5c6529e17 [PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats (#7317) 2025-08-24 23:16:43 -07:00
Shangming Cai
25ef53f05f [PD] Fix nvlink transport accuracy through transferring metadata with tcp (#9261)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-08-20 19:29:10 -07:00
fzyzcjy
fe43e889f8 Fix mini lb timeout issue (#9369) 2025-08-19 20:15:16 -07:00
chenxu140
01d47a27b6 [Bugfix] fix kv buffer register & dp attention & deepepmoe (#9327) 2025-08-19 10:09:48 -07:00
datdo-msft
98b44e9e56 [PD] Propagate internal server errors from aborted requests to clients instead of blindly returning 200's (#8936) 2025-08-18 14:23:46 -07:00
Shangming Cai
384f8ab5ce [PD] Support PD disaggregation with Prefill PP (#8846)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: root <huzhiyuan@xiaohongshu.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com>
Co-authored-by: zitto <zhjc1124@gmail.com>
2025-08-16 18:31:31 -07:00
Lianmin Zheng
9e426466af Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-13 13:56:04 -07:00
Teng Ma
4a16a71c36 [PD] feat: mooncake use batch reg/dereg (#8910)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-08-13 09:54:34 -07:00
Francis
a16923efab [PD] optimize kv cache transfer directly using batch transfer (#9149)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-08-13 09:54:14 -07:00
Liangsheng Yin
f9afa7dceb Fix docs for clip max new tokens (#9082) 2025-08-11 13:15:21 -07:00
Jimmy
0d9e89ec69 [PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated (#8866) 2025-08-11 13:08:11 -07:00
PGFLMG
b7cd743038 [Feat] QWen-1M context support[2/2]: Update block sparse attention backend (#5949) 2025-08-06 23:49:36 -07:00
Shangming Cai
d98a4913ea [PD] Refactor parallel sizes and add pp support for mooncake (#8571)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-08-04 20:18:11 -07:00
ybyang
6f9baf1002 [Improvements] Merge health check route (#8444)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-08-03 01:59:06 -07:00
萝卜菜
2d401bd99d [fix] fix pd disagg error of vlms (#8094) 2025-08-02 02:16:29 +08:00
Simo Lin
5c14515fec [bug] remove pdlb from minilb since its no longer available (#8634) 2025-07-31 13:54:02 -07:00
Shangming Cai
016fd25127 [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (#8595)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-07-31 21:29:34 +08:00
Lianmin Zheng
a4c3b121d8 Split the scheduler into multiple mixin classes to reduce the file size (#8483) 2025-07-29 12:46:50 -07:00
Shangming Cai
2fd5c7049f [PD] Fix abort_request for PD disaggregation (#8352)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
2025-07-27 21:48:27 -07:00
Shangming Cai
22e00eeb4a [Bugfix] Prevent PD server crash from invalid grammar (#8062)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-07-28 00:17:51 +08:00
Stepan Kargaltsev
1b9cea5ade [P/D] Support ipv6 in P/D scenario (#7858)
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-07-25 08:53:30 -07:00
Shangming Cai
1403ea5694 [PD] Support non-MLA models PD different TP with DP attention (#7931)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-07-18 22:00:49 -07:00
Hanming Lu
9379da77de SWA Prefix Cache (#7367)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-07-13 12:31:07 -07:00
fzyzcjy
c46e069d34 Tiny fix mooncake log warning wrong output (#7952) 2025-07-12 21:22:44 -07:00
fzyzcjy
880221bd3b Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (#7236)" (#7968) 2025-07-11 19:03:01 -07:00
ronnie_zheng
86044712c6 [feature] kv transfer support of ascend npu (#7795)
Co-authored-by: liupeng <liupeng374@huawei.com>
2025-07-11 00:07:51 -07:00
almaslof
f9df11ae86 Remove unused imports (#7898) 2025-07-09 22:36:48 +08:00
Shangming Cai
64c5907e12 [PD] Add guidance for prefill bootstrap timeout (#7846)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-07-08 21:00:34 -07:00
Cheng Wan
8fc910db03 DP Attention with Auto DeepEP Dispatch (#7222) 2025-07-05 01:54:24 -07:00
Caproni
af5647748a [Fix] Alloc return type error (#7778)
Signed-off-by: Capronir <839972205@qq.com>
2025-07-04 19:00:40 -07:00
Ziming Huang
1bebd3154e Fix num_tokens_pre_allocated in disaggregation log (#7714) 2025-07-02 22:31:49 -07:00
Shangming Cai
eb429b88a4 [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (#7598)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-27 22:22:01 -07:00
tarinkk
eb6c2c1663 Hybrid kv cache for LLaMA4 (#6563)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-06-27 18:58:55 -07:00
Trevor Morris
bb9b608c86 [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (#7330) 2025-06-26 10:39:39 -07:00
Shangming Cai
5c2142579a [PD] Raise error for incompatible mooncake version and some minor fixes (#7527)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-25 18:55:24 -07:00
eigen
20beb3702b feat: add return hidden_states at async generation (#7507) 2025-06-25 02:10:09 -07:00
Hongbo Xu
e21aa1df67 [PD] Add different TP sizes support for no-MLA models (#6793)
Co-authored-by: shangmingc <csmthu@gmail.com>
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-25 02:00:22 -07:00
linzhuo
afeed46530 clean duplicate code (#7512) 2025-06-25 01:22:20 -07:00
Shangming Cai
7b9a174a7a [PD][Spec] Fix hidden state transfer for spec decode (#7516)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-25 00:42:07 -07:00
Trevor Morris
5f527834a8 [PD] NIXL: Register kv args in advance and cleanup finished requests (#6717) 2025-06-24 11:26:09 -07:00
Francis
2ed68d7a6c [PD Disaggregation] replace transfer with batch transfer for better performance (#7236) 2025-06-24 02:12:04 -07:00
Liangsheng Yin
05c9bc8956 [minor] simplify the TokenToKVPoolAllocator (#7414) 2025-06-22 12:37:18 +08:00
Atream
02bf31ef29 [fix] PD disaggregation when enable mtp and tp!=dp (#7420) 2025-06-21 12:03:11 -07:00
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
Shangming Cai
187b85b7f3 [PD] Optimize custom mem pool usage and bump mooncake version (#7393)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-20 09:50:39 -07:00
Shangming Cai
f88e70853e [Bugfix][PD] Set conclude state before clear when failure happens (#7362)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-19 11:26:53 -07:00
Atream
4f838c09cd [PD] Transfer hidden states for mtp when disaggregation (#7242) 2025-06-19 11:22:47 -07:00