Lianmin Zheng
|
b2ccf36d4d
|
Fix memory leak during abort (#2238)
|
2024-11-28 02:22:15 -08:00 |
|
Ying Sheng
|
37c8a5761f
|
[feat] Support session control for vision language models (#2210)
|
2024-11-27 00:03:29 -08:00 |
|
Rin Intachuen
|
1aea19f64b
|
Input_embeds support (#2052)
|
2024-11-25 16:35:04 -08:00 |
|
Lianmin Zheng
|
731146f6cb
|
Fix mixed chunked prefill in overlap mode (#2158)
|
2024-11-24 07:17:37 -08:00 |
|
Lianmin Zheng
|
5652c56535
|
Update CI threshold & Improve code style (#2159)
|
2024-11-24 06:29:38 -08:00 |
|
Lianmin Zheng
|
c211e7b669
|
Simplify batch update (#2154)
|
2024-11-24 04:47:10 -08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Ying Sheng
|
5942dfc00a
|
[feat] Add session control (#2073)
|
2024-11-20 00:36:53 -08:00 |
|
Lianmin Zheng
|
7d671e4ad2
|
Enable overlap by default (#2067)
|
2024-11-19 22:07:58 -08:00 |
|
Lianmin Zheng
|
ffd20fcd03
|
Make constrained decoding work for overlap scheduler (#2095)
|
2024-11-19 15:04:43 -08:00 |
|
Lianmin Zheng
|
b7a065eae3
|
Use cuda event wait and synchronization instead of busy waiting (#2089)
|
2024-11-19 00:21:46 -08:00 |
|
Lianmin Zheng
|
b110453802
|
Simplify logits penalizer (#2086)
|
2024-11-18 17:48:28 -08:00 |
|
Lianmin Zheng
|
116685337e
|
Fix cuda illegal memory access in overlap mode (#2070)
|
2024-11-17 21:29:30 -08:00 |
|
Lianmin Zheng
|
ebaa2f3199
|
Rename arguments --disable-nan-detection to --enable-nan-detection (#2066)
|
2024-11-17 16:53:44 -08:00 |
|
Ke Bao
|
62832bb272
|
Support cuda graph for DP attention (#2061)
|
2024-11-17 16:29:20 -08:00 |
|
Lianmin Zheng
|
edad373135
|
Fix illegal memory access in overlap mode & Use more fused triton kernels for building meta data (#2051)
|
2024-11-16 16:14:23 -08:00 |
|
Ke Bao
|
976bc302e5
|
Support DP MLA (#1970)
|
2024-11-16 09:01:43 +00:00 |
|
HAI
|
e5c6715003
|
Fix core (MI300X) with --enable-overlap (#2048)
|
2024-11-15 21:24:42 -08:00 |
|
Lianmin Zheng
|
54479d6f30
|
Fix grammar backend for tensor parallelism (#2020)
|
2024-11-13 01:49:45 -08:00 |
|
Lianmin Zheng
|
ba069a24d3
|
Fix grammar backend (#2018)
|
2024-11-12 21:17:38 -08:00 |
|
Lianmin Zheng
|
78c1d6445f
|
Fix finish reason (#2013)
|
2024-11-11 23:24:41 -08:00 |
|
Lianmin Zheng
|
befc6beb86
|
Fix a typo in io_struct.py (#2008)
|
2024-11-11 16:34:10 -08:00 |
|
yizhang2077
|
a8aad9357d
|
qwen2vl fix bug for #1971 #1897 (#1984)
|
2024-11-10 08:10:45 -08:00 |
|
Lianmin Zheng
|
1929c06762
|
Simplify prometheus metrics (#1981)
Co-authored-by: Mohit Reddy <mohitreddy1996@users.noreply.github.com>
|
2024-11-10 04:39:32 -08:00 |
|
Chayenne
|
c77c1e05ba
|
fix black in pre-commit (#1940)
|
2024-11-08 07:42:47 +08:00 |
|
Lzhang-hub
|
a146d9990e
|
support prometheus metrics (#1853)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
|
2024-11-05 20:42:53 -08:00 |
|
Liangsheng Yin
|
b9fd178f1b
|
Fix retraction + overlap (#1860)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-10-31 18:27:42 -07:00 |
|
Lianmin Zheng
|
a2e0424abf
|
Fix memory leak for chunked prefill 2 (#1858)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2024-10-31 14:51:51 -07:00 |
|
DarkSharpness
|
b77a02cdfd
|
[Performance] Support both xgrammar and outlines for constrained decoding (#1752)
|
2024-10-25 21:47:02 +00:00 |
|
Lianmin Zheng
|
c555ce2ca2
|
Revert "Fix memory leak when doing chunked prefill" (#1797)
|
2024-10-25 10:24:44 -07:00 |
|
Liangsheng Yin
|
a2f5e7555f
|
Fix memory leak when doing chunked prefill (#1787)
|
2024-10-25 08:01:17 -07:00 |
|
Lianmin Zheng
|
8f8f96a621
|
Fix the perf regression due to additional_stop_token_ids (#1773)
|
2024-10-23 16:45:21 -07:00 |
|
Liangsheng Yin
|
3f5ac88d02
|
Fix out of memory message. (#1771)
|
2024-10-23 15:20:39 -07:00 |
|
Liangsheng Yin
|
94cde10920
|
Llama3.2 vision model support (#1551)
|
2024-10-21 15:01:21 -07:00 |
|
Lianmin Zheng
|
09603c6dc9
|
Maintain seq_lens_sum to make more FlashInfer operations non-blocking (#1741)
|
2024-10-21 01:43:16 -07:00 |
|
Lianmin Zheng
|
b121bc03a3
|
Simplify batch result resolution (#1735)
|
2024-10-20 19:47:14 -07:00 |
|
Lianmin Zheng
|
e12358dc91
|
Simplify the usage of device (#1734)
|
2024-10-20 18:17:41 -07:00 |
|
Lianmin Zheng
|
b48edff67f
|
Split the overlapped version of TpModelWorkerClient into a separate file (#1726)
|
2024-10-20 00:29:29 -07:00 |
|
Lianmin Zheng
|
59cbf47626
|
Unify the memory pool api and tp worker API (#1724)
|
2024-10-19 23:19:26 -07:00 |
|
Yineng Zhang
|
cbbc82b7b8
|
Support qwen2 vl model (#1721)
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: ispobock <ISPObaoke@163.com>
|
2024-10-19 21:44:38 -07:00 |
|
Lianmin Zheng
|
769bf11c05
|
Fix the race condition in overlap mode (#1712)
|
2024-10-19 06:50:56 -07:00 |
|
Lianmin Zheng
|
f0f8a7699b
|
Simplify the nan detection and greedy check in sampler (#1709)
|
2024-10-18 20:21:24 -07:00 |
|
Lianmin Zheng
|
2bcfba1b08
|
Skip unnecessary penalizer (#1707)
|
2024-10-18 17:54:03 -07:00 |
|
Lianmin Zheng
|
392f2863c8
|
Add dtype for more operations (#1705)
|
2024-10-18 12:18:15 -07:00 |
|
Lianmin Zheng
|
6d0fa73ece
|
Simplify flashinfer utilities (#1704)
|
2024-10-17 22:54:14 -07:00 |
|
havetc
|
ecb8bad276
|
Returning a per request metric for number of cached_tokens read (#1599)
|
2024-10-16 11:49:22 -07:00 |
|
Lianmin Zheng
|
9116b2896f
|
Add a new event loop (#1677)
|
2024-10-16 01:33:20 -07:00 |
|
Liangsheng Yin
|
b6b4094621
|
Fix filter_batch function call (#1681)
|
2024-10-15 22:59:26 -07:00 |
|
Lianmin Zheng
|
4a292f670d
|
[Minor] Add some utility functions (#1671)
|
2024-10-14 20:08:03 -07:00 |
|
Lianmin Zheng
|
24f3e1511c
|
[Minor] Improve style (#1666)
|
2024-10-14 05:25:00 -07:00 |
|