Commit Graph

832 Commits

Author SHA1 Message Date
Lianmin Zheng
0d800090b4 Fix missing additional_stop_token_ids (#1769) 2024-10-23 12:18:59 -07:00
Lianmin Zheng
80a905475d Fix stop condition for <|eom_id|> (#1766) 2024-10-23 10:47:12 -07:00
Lianmin Zheng
9af7b88e3c [Fix] Fix abort in dp (#1767) 2024-10-23 10:46:29 -07:00
Lianmin Zheng
fbcbb26327 Fix perf regression for set_kv_buffer (#1765) 2024-10-23 09:57:08 -07:00
Ying Sheng
2fce449b1c [API] add get memory pool size (#1760)
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-10-23 07:02:29 +00:00
Lianmin Zheng
ad4125d1a9 Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00
Byron Hsu
17536e7e3d Fix edge case for truncated (#1747) 2024-10-23 00:00:25 -04:00
Lianmin Zheng
1f26e8b8e4 Release v0.3.4.post1 (#1749) 2024-10-21 21:16:43 -07:00
Liangsheng Yin
5e1558f1f2 Update max_req_len and max_req_input_len (#1748) 2024-10-21 16:12:04 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
Lianmin Zheng
00611286a1 Fix sliding window attention and gemma-2 unit tests in CI (#1746) 2024-10-21 13:47:12 -07:00
Lianmin Zheng
7ce3606891 Faster overlap mode scheduler (#1738) 2024-10-21 04:30:52 -07:00
Liangsheng Yin
efb099cdee Fix prefill oom (#1743) 2024-10-21 03:54:35 -07:00
Lianmin Zheng
09603c6dc9 Maintain seq_lens_sum to make more FlashInfer operations non-blocking (#1741) 2024-10-21 01:43:16 -07:00
Lianmin Zheng
cf470fea32 Make token mapping non-blocking in the overlapped mode (#1740) 2024-10-20 23:25:14 -07:00
sixgod
45d5af2416 Add GLM-4 TextGeneration Model support for SGLang (#1736) 2024-10-21 04:08:30 +00:00
Lianmin Zheng
b121bc03a3 Simplify batch result resolution (#1735) 2024-10-20 19:47:14 -07:00
Lianmin Zheng
e12358dc91 Simplify the usage of device (#1734) 2024-10-20 18:17:41 -07:00
yizhang2077
554fbf93cd [Bugfix] qwen2vl forward_extend (#1727) 2024-10-20 02:38:35 -07:00
Lianmin Zheng
b48edff67f Split the overlapped version of TpModelWorkerClient into a separate file (#1726) 2024-10-20 00:29:29 -07:00
Lianmin Zheng
59cbf47626 Unify the memory pool api and tp worker API (#1724) 2024-10-19 23:19:26 -07:00
Yineng Zhang
cbbc82b7b8 Support qwen2 vl model (#1721)
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-10-19 21:44:38 -07:00
Yineng Zhang
8bee20f80b Update vllm to 0.6.3 (#1711) (#1720)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2024-10-19 20:45:41 -07:00
Lianmin Zheng
12cad0feae Simplify the interface of tp_worker (#1718) 2024-10-19 17:39:38 -07:00
Lianmin Zheng
b6cd903604 Update readme and workflow (#1716) 2024-10-19 13:01:44 -07:00
Lianmin Zheng
087257ea03 Release v0.3.4 (#1714) 2024-10-19 08:17:41 -07:00
Lianmin Zheng
769bf11c05 Fix the race condition in overlap mode (#1712) 2024-10-19 06:50:56 -07:00
Lianmin Zheng
3db43d1b08 Fix is_all_ready for overlap copy (#1710) 2024-10-18 21:01:52 -07:00
Lianmin Zheng
f0f8a7699b Simplify the nan detection and greedy check in sampler (#1709) 2024-10-18 20:21:24 -07:00
Lianmin Zheng
2bcfba1b08 Skip unnecessary penalizer (#1707) 2024-10-18 17:54:03 -07:00
Lianmin Zheng
bc12d4033f Add grouped free operations (#1706) 2024-10-18 13:21:05 -07:00
Lianmin Zheng
392f2863c8 Add dtype for more operations (#1705) 2024-10-18 12:18:15 -07:00
Lianmin Zheng
6d0fa73ece Simplify flashinfer utilities (#1704) 2024-10-17 22:54:14 -07:00
Liangsheng Yin
9e0dac1ad7 Fix regex and logprob conflicts when chunked prefilling (#1703) 2024-10-17 18:33:21 -07:00
Gleb Drozdov
a95d5589c3 Add matched_stop token or str to distinguish between eos or stop str finish_reason generation (#1684) 2024-10-17 18:06:52 +00:00
Lianmin Zheng
d17d19e5b8 Fix mixed batch for multi modal models (#1702) 2024-10-17 10:27:26 -07:00
Lianmin Zheng
dd3809fad8 Fix engine unit test (#1701) 2024-10-17 09:53:32 -07:00
Lianmin Zheng
7feba41584 Fix failed ci tests on long prompts; Better error messages for embedding models (#1700) 2024-10-17 09:23:29 -07:00
Michael Feil
e5db40dcbc ORJson. Faster Json serialization (#1694) 2024-10-17 08:03:08 -07:00
wxsm
b170930534 feat: radix tree code optimize (#1697) 2024-10-17 08:01:27 -07:00
Jani Monoses
5ab20cceba Use SGLang imports for linear layer (#1696) 2024-10-17 07:50:01 -07:00
Lianmin Zheng
02f7f3e488 Update the transformers version in CI (#1690) 2024-10-16 19:03:55 -07:00
Zeng Zhongchao
2782132be8 Add date to logging messages (#1623) (#1679) 2024-10-16 18:54:55 -07:00
Michael Feil
b0facb3316 add orjson for jsonresponse (#1688) 2024-10-16 18:14:30 -07:00
havetc
ecb8bad276 Returning a per request metric for number of cached_tokens read (#1599) 2024-10-16 11:49:22 -07:00
Lianmin Zheng
dbec2f1847 Launch a thread to overlap CPU and GPU (#1687) 2024-10-16 11:20:17 -07:00
Ke Bao
d10b933a36 Fix srt dependency (#1685) 2024-10-16 08:21:20 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Jani Monoses
a5114b6f91 Add OLMo model (#1676) 2024-10-16 00:11:18 -07:00
Liangsheng Yin
b6b4094621 Fix filter_batch function call (#1681) 2024-10-15 22:59:26 -07:00