Commit Graph

29 Commits

Author SHA1 Message Date
Zhiqiang Xie
b26cb1c55a Fix problem of large page size with chunked prefill (#6046) 2025-05-06 15:19:47 +08:00
Zhiqiang Xie
f8e460930a Fix prefill OOM error in the case of large page size (#5081) 2025-05-05 16:02:55 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Zhiqiang Xie
10b544ae9b Hierarchical Caching Refactoring and Fixing TP issue (#4082) 2025-03-12 11:22:35 -07:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Zhiqiang Xie
8af7048dcf Query remaining memory dynamically for PrefillAdder (#2941) 2025-01-17 20:20:26 -08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
libra
bdb3929dbb Refactor SchedulePolicy to improve code organization (#2571) 2025-01-04 00:05:16 +08:00
Liangsheng Yin
e7ebecf82e Fix cache hit rate when chunked prefill (#2555) 2024-12-26 03:14:28 -08:00
Lianmin Zheng
56198b45d9 Add a benchmark script for in-batch prefix caching (#2494) 2024-12-16 18:49:02 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Liangsheng Yin
5f12f0e7af Fix chunked prefill when ignore eos (#2290) 2024-12-01 00:37:53 -08:00
Lianmin Zheng
f5b5f2bff9 Revert "[Fix] fix assertion error for chunked prefill when disabling cache" (#2286) 2024-11-30 19:03:42 -08:00
Rui Wang
d622851dc9 [Fix] fix assertion error for chunked prefill when disabling cache (#2282) 2024-11-30 17:53:43 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Lianmin Zheng
80e2c4a8de Fix chunked prefill with output logprob (#2083) 2024-11-18 13:16:28 -08:00
Lianmin Zheng
1929c06762 Simplify prometheus metrics (#1981)
Co-authored-by: Mohit Reddy <mohitreddy1996@users.noreply.github.com>
2024-11-10 04:39:32 -08:00
Lzhang-hub
a146d9990e support prometheus metrics (#1853)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-11-05 20:42:53 -08:00
Lianmin Zheng
efbc116a0f Do not use longest prefix matching when #queue-req is large (#1896) 2024-11-03 01:45:20 -07:00
Lianmin Zheng
2b80978859 Provide an argument to set the maximum batch size for cuda graph (#1809) 2024-10-26 15:09:33 -07:00
Lianmin Zheng
c555ce2ca2 Revert "Fix memory leak when doing chunked prefill" (#1797) 2024-10-25 10:24:44 -07:00
Liangsheng Yin
a2f5e7555f Fix memory leak when doing chunked prefill (#1787) 2024-10-25 08:01:17 -07:00
havetc
ecb8bad276 Returning a per request metric for number of cached_tokens read (#1599) 2024-10-16 11:49:22 -07:00
Lianmin Zheng
24f3e1511c [Minor] Improve style (#1666) 2024-10-14 05:25:00 -07:00
Ke Bao
68f8b60d22 Fix chunked prefill condition (#1594) 2024-10-07 06:34:14 +00:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Liangsheng Yin
5d0ba4038f Refine the add request reasons to avoid corner cases. (#1574) 2024-10-04 18:00:18 -07:00
Lianmin Zheng
36d5acfca5 Rename InputMetadata -> ForwardBatch (#1543) 2024-09-30 02:41:11 -07:00