Commit Graph

392 Commits

Author SHA1 Message Date
Xinyuan Tong
12d6cf18f0 Refactors radix cache for extra key support (#10317)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-09-22 02:16:16 +08:00
Baizhou Zhang
8ecef73f12 [1/2] Support deterministic inference with flashinfer attention backend (#10645)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-19 23:34:29 -07:00
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00
penguin_wwy
93f75778be [RL] Add destroy process group api (#9979) 2025-09-19 00:31:56 +08:00
Xuchun Shang
1ccd59c715 [HICache] introduce evict policy (#10190)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
2025-09-18 11:10:20 +08:00
harrisonlimh
14fdd52740 feat: add priority based scheduling with priority based request acceptance and preemption (#8746) 2025-09-16 17:10:10 -07:00
Yingchun Lai
b1721edbac [PD metrics] Add latency Histogram metrics of each stage for generate requests (#8710) 2025-09-16 01:52:49 +08:00
Mick
0549f21c60 fix: fix max_new_tokens uninitialized error (#9343) 2025-09-15 12:06:55 +08:00
Liangsheng Yin
305c9e8c2d [4/N]DP refactor: support watching mode get_load and shortest queue strategy (#10201) 2025-09-15 10:06:08 +08:00
Feng Su
4c21b09074 [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 (#9962)
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
2025-09-15 02:08:02 +08:00
Yingchun Lai
21ca4c3afa [PD metrics] Fix some uncompleted PD related metrics (#8627) 2025-09-14 02:26:58 -07:00
Sundara Raman Ramachandran
a360511d7b [Generative Score API] Scoring(Prefill-only) optimizations. (#9748) 2025-09-14 01:57:06 +08:00
amysaq2023
30d20ce84f Support loading weights from remote instance (#8215)
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Co-authored-by: Chayenne <74843776+zhaochenyang20@users.noreply.github.com>
2025-09-12 17:40:22 +08:00
Yi Zhang
30c6e1f569 Qwen3-Next support (#10233)
Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
2025-09-11 04:11:49 -07:00
DarkSharpness
948b01a04c [Refactor] Remove Hicache Load & Write threads (#10127)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-08 22:18:50 -07:00
Cao E
7577f0e40f Add graph runner support with torch compile on CPU (#7843) 2025-09-07 21:33:58 -07:00
Qiaolin Yu
8cda5a622c Standalone speculative decoding (#10090) 2025-09-07 20:55:09 -07:00
Yuwei An
9a7ced4e4d [Feature] LMCache Connector Integration (#9741)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-06 20:14:55 -07:00
pansicheng
f84db115b1 Add storage read/write bandwidth logs to monitor kvcache performance (#9965)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-05 16:52:55 -07:00
fzyzcjy
df97b31f37 Tiny support setting numa nodes for different ranks (#10006) 2025-09-05 19:01:27 +08:00
limingshu
afd9f2f560 Fix typo in scheduler (#9934) 2025-09-05 17:45:27 +08:00
Huang Long
f98366604b fix MultiTokenizerWrapper name (#10049)
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
2025-09-05 13:39:46 +08:00
Liangsheng Yin
27e8ffed37 [1/N] DP-refactor: move dp balance code into scheduler's mixin class (#10004) 2025-09-04 16:53:58 +08:00
Lianmin Zheng
60e37f8028 Move parsers under a single folder (#9912) 2025-09-02 18:25:04 -07:00
huangtingwei
cb9e0e4180 [HiCacheStorage] fix abort request host memory leaks (#9874)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-01 18:59:29 -07:00
ybyang
5f77e1292d Support Multi Process Tokenizer Manager(#6555) (#8964)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-09-01 01:00:13 -07:00
Teng Ma
f05c68733e [HiCache] Clear kvcache in storage backend with fastAPI (#9750)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2025-08-31 17:41:44 +08:00
Liangsheng Yin
836873b99f Fix memory leak when aborting decode request in PD-Disagg (#9817)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
2025-08-30 14:36:03 +08:00
Zhiqiang Xie
54e872d343 [HiCache] resolve conflict between chunked-prefill and hicache hit count (#9776) 2025-08-30 01:30:54 +08:00
hzh0425
c04c17edfa refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management (#9555)
Co-authored-by: Teng Ma <805522925@qq.com>
2025-08-26 17:55:20 -07:00
Zhiqiang Xie
43de1d7304 HiCache Storage fix host memory leak (#9648) 2025-08-26 10:49:40 -07:00
Sundara Raman Ramachandran
ea0696b924 [Performance] Batch Send from Tokenizer Manager. (#9436) 2025-08-26 01:43:54 +08:00
Chanh Nguyen
127d4b0d5e Support GC Freezing to improve latency & throughput (#9241)
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2025-08-23 13:43:09 +08:00
datdo-msft
98b44e9e56 [PD] Propagate internal server errors from aborted requests to clients instead of blindly returning 200's (#8936) 2025-08-18 14:23:46 -07:00
Shangming Cai
384f8ab5ce [PD] Support PD disaggregation with Prefill PP (#8846)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: root <huzhiyuan@xiaohongshu.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com>
Co-authored-by: zitto <zhjc1124@gmail.com>
2025-08-16 18:31:31 -07:00
Cheng Wan
295895120d [6/N] MoE Refactor: Cleanup MoE-related configs (#8849) 2025-08-14 21:14:53 -07:00
Sundara Raman Ramachandran
a027a9b4b3 [Generative Score API] Optimization to Remove Decode. (#8840) 2025-08-14 05:12:24 +08:00
Lianmin Zheng
9e426466af Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-13 13:56:04 -07:00
Baizhou Zhang
75e6a7cde1 Support radix cache for Lora feature (#7216) 2025-08-11 10:14:11 -07:00
DarkSharpness
19bc77f05c [Fix] Fix hicache backend (#8991) 2025-08-09 17:16:25 -07:00
Zilin Zhu
dd650e0e21 [RL] fix skip_server_warmup and rl health_generate logic (#8757) 2025-08-08 04:34:38 -07:00
Lianmin Zheng
a947154286 Revert "Support Multi Process Tokenizer Manager" (#8960) 2025-08-08 02:28:27 -07:00
pansicheng
e2fd2b9c7e Simple prefetch policy (#8692) 2025-08-08 02:09:28 -07:00
ybyang
7490e3f67d Support Multi Process Tokenizer Manager (#6555)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: lw9527 <952799980@qq.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
2025-08-08 01:45:50 -07:00
fzyzcjy
774b47f3f1 Reduce scheduler recv requests overhead (#8947) 2025-08-08 00:10:05 -07:00
Lifu Huang
6210e2c4f0 Support GPU pinning for LoRA (#8697) 2025-08-06 19:39:45 -07:00
Baizhou Zhang
f2d68ded6d Rename lora_path to lora_id in batches (#8437) 2025-08-03 21:08:28 -07:00
ybyang
6f9baf1002 [Improvements] Merge health check route (#8444)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-08-03 01:59:06 -07:00
Guanhua Wang
f7b2853ff8 [feat] support minimum token load balance in dp attention (#7379) 2025-08-03 00:46:47 -07:00
Wenxuan Tan
0305c5053f Reduce memory accumulation in long-running server (#8306)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2025-08-03 15:03:16 +08:00