Zhiqiang Xie
|
b26cb1c55a
|
Fix problem of large page size with chunked prefill (#6046)
|
2025-05-06 15:19:47 +08:00 |
|
Zhiqiang Xie
|
f8e460930a
|
Fix prefill OOM error in the case of large page size (#5081)
|
2025-05-05 16:02:55 -07:00 |
|
Lianmin Zheng
|
c76040e31b
|
Support page size > 1 (#4356)
|
2025-03-12 22:22:39 -07:00 |
|
Zhiqiang Xie
|
10b544ae9b
|
Hierarchical Caching Refactoring and Fixing TP issue (#4082)
|
2025-03-12 11:22:35 -07:00 |
|
Ying Sheng
|
d3d4d76758
|
[Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
|
2025-03-05 08:06:07 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Zhiqiang Xie
|
8af7048dcf
|
Query remaining memory dynamically for PrefillAdder (#2941)
|
2025-01-17 20:20:26 -08:00 |
|
Lianmin Zheng
|
f65c13b559
|
Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902)
|
2025-01-15 04:54:14 -08:00 |
|
libra
|
bdb3929dbb
|
Refactor SchedulePolicy to improve code organization (#2571)
|
2025-01-04 00:05:16 +08:00 |
|
Liangsheng Yin
|
e7ebecf82e
|
Fix cache hit rate when chunked prefill (#2555)
|
2024-12-26 03:14:28 -08:00 |
|
Lianmin Zheng
|
56198b45d9
|
Add a benchmark script for in-batch prefix caching (#2494)
|
2024-12-16 18:49:02 -08:00 |
|
SangBin Cho
|
9208618b3e
|
[Core] in batch prefix caching by delay scheduling (#2442)
|
2024-12-11 12:51:50 -08:00 |
|
Liangsheng Yin
|
5f12f0e7af
|
Fix chunked prefill when ignore eos (#2290)
|
2024-12-01 00:37:53 -08:00 |
|
Lianmin Zheng
|
f5b5f2bff9
|
Revert "[Fix] fix assertion error for chunked prefill when disabling cache" (#2286)
|
2024-11-30 19:03:42 -08:00 |
|
Rui Wang
|
d622851dc9
|
[Fix] fix assertion error for chunked prefill when disabling cache (#2282)
|
2024-11-30 17:53:43 -08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Lianmin Zheng
|
80e2c4a8de
|
Fix chunked prefill with output logprob (#2083)
|
2024-11-18 13:16:28 -08:00 |
|
Lianmin Zheng
|
1929c06762
|
Simplify prometheus metrics (#1981)
Co-authored-by: Mohit Reddy <mohitreddy1996@users.noreply.github.com>
|
2024-11-10 04:39:32 -08:00 |
|
Lzhang-hub
|
a146d9990e
|
support prometheus metrics (#1853)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
|
2024-11-05 20:42:53 -08:00 |
|
Lianmin Zheng
|
efbc116a0f
|
Do not use longest prefix matching when #queue-req is large (#1896)
|
2024-11-03 01:45:20 -07:00 |
|
Lianmin Zheng
|
2b80978859
|
Provide an argument to set the maximum batch size for cuda graph (#1809)
|
2024-10-26 15:09:33 -07:00 |
|
Lianmin Zheng
|
c555ce2ca2
|
Revert "Fix memory leak when doing chunked prefill" (#1797)
|
2024-10-25 10:24:44 -07:00 |
|
Liangsheng Yin
|
a2f5e7555f
|
Fix memory leak when doing chunked prefill (#1787)
|
2024-10-25 08:01:17 -07:00 |
|
havetc
|
ecb8bad276
|
Returning a per request metric for number of cached_tokens read (#1599)
|
2024-10-16 11:49:22 -07:00 |
|
Lianmin Zheng
|
24f3e1511c
|
[Minor] Improve style (#1666)
|
2024-10-14 05:25:00 -07:00 |
|
Ke Bao
|
68f8b60d22
|
Fix chunked prefill condition (#1594)
|
2024-10-07 06:34:14 +00:00 |
|
Lianmin Zheng
|
9244f27f0a
|
[Minor] Improve the style and fix flaky tests (#1584)
|
2024-10-06 00:10:48 -07:00 |
|
Liangsheng Yin
|
5d0ba4038f
|
Refine the add request reasons to avoid corner cases. (#1574)
|
2024-10-04 18:00:18 -07:00 |
|
Lianmin Zheng
|
36d5acfca5
|
Rename InputMetadata -> ForwardBatch (#1543)
|
2024-09-30 02:41:11 -07:00 |
|