Commit Graph

62 Commits

Author SHA1 Message Date
Lianmin Zheng
9cd9dc83b3 Temporarily disable unit test of torch native attention backend (#2492) 2024-12-16 14:17:27 -08:00
Ying Sheng
8586b72da0 [feat] Enable chunked prefill for llava-onevision (#2412) 2024-12-09 09:52:38 -08:00
Lianmin Zheng
641b7d0ae0 [Minor] Improve code style (#2422) 2024-12-09 06:30:35 -08:00
Xiaoyu Zhang
3844feb9bb Add a unittest for fused_moe (#2416) 2024-12-08 22:46:10 -08:00
Ying Sheng
aa47f64223 Revert "[feat] Enable chunked prefill for llava-onevision" (#2329) 2024-12-02 23:11:13 -08:00
Ying Sheng
480e38a733 [feat] Enable chunked prefill for llava-onevision (#2281) 2024-12-02 20:19:02 -08:00
Qun Yang
62c516ac45 Add a simple torch native attention backend (#2241) 2024-12-01 03:01:25 -08:00
Lianmin Zheng
9449a95431 [CI] Balance CI tests (#2293) 2024-12-01 01:47:30 -08:00
Lianmin Zheng
0303ca918f [CI] Fix missing files in run_suite.py (#2288) 2024-11-30 23:53:34 -08:00
Lianmin Zheng
4936be8acc Revert "Revert "[FEAT] Support GGUF format"" (#2287) 2024-11-30 22:14:48 -08:00
Lianmin Zheng
7e4c6dd8da Revert "[FEAT] Support GGUF format" (#2285) 2024-11-30 19:03:26 -08:00
Yang Zheng
883c955489 [FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-30 00:44:48 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Lianmin Zheng
b2ccf36d4d Fix memory leak during abort (#2238) 2024-11-28 02:22:15 -08:00
Ying Sheng
37c8a5761f [feat] Support session control for vision language models (#2210) 2024-11-27 00:03:29 -08:00
Lianmin Zheng
1605ae121e [CI] Minor fix for CI (#2187) 2024-11-25 16:38:43 -08:00
Rin Intachuen
1aea19f64b Input_embeds support (#2052) 2024-11-25 16:35:04 -08:00
Ying Sheng
e1e595d702 [feat] Refactor session control interface and add CI (#2173) 2024-11-25 12:32:51 -08:00
Lianmin Zheng
254fd130e2 [CI] Split test cases in CI for better load balancing (#2180) 2024-11-25 04:58:16 -08:00
Lianmin Zheng
7d671e4ad2 Enable overlap by default (#2067) 2024-11-19 22:07:58 -08:00
Lianmin Zheng
c3eac1b010 Fix torch.compile for MoE (#2033) 2024-11-14 01:30:24 -08:00
Lianmin Zheng
9c939a3d8b Clean up metrics code (#1972) 2024-11-09 15:43:20 -08:00
Lianmin Zheng
2ce32db6fb Let reward model take text inputs instead of message lists (#1907)
Co-authored-by: Kyle Corbitt <kyle@corbt.com>
2024-11-03 13:27:12 -08:00
Lianmin Zheng
c17c578108 Simplify tokenizer manager (#1904) 2024-11-03 08:38:26 -08:00
Lianmin Zheng
a2e0424abf Fix memory leak for chunked prefill 2 (#1858)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2024-10-31 14:51:51 -07:00
Ke Bao
c77762d57f Fix Triton decode kernel & ut (#1819) 2024-10-27 10:54:38 -07:00
Lianmin Zheng
e646c5901e Fix logprob in the overlapped mode (#1795) 2024-10-25 11:06:57 -07:00
Lianmin Zheng
40900baea7 [Fix] Fix the log parsing in chunked prefill uni tests (#1794) 2024-10-25 08:31:08 -07:00
Liangsheng Yin
a2f5e7555f Fix memory leak when doing chunked prefill (#1787) 2024-10-25 08:01:17 -07:00
Lianmin Zheng
1701b0db31 Enhance the test case for chunked prefill (#1785) 2024-10-24 21:23:09 -07:00
Lianmin Zheng
00611286a1 Fix sliding window attention and gemma-2 unit tests in CI (#1746) 2024-10-21 13:47:12 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Shuo Yang
061e546313 Support double sparsity (#1459) 2024-10-14 02:00:41 -07:00
Lianmin Zheng
869f1c02c4 Add a test case to test retract (#1662) 2024-10-13 20:32:37 -07:00
Byron Hsu
862cd265e5 [engine] support async and streaming (#1614) 2024-10-11 15:26:25 -07:00
Byron Hsu
551a3a9d38 Provide an offline engine API (#1567) 2024-10-06 20:27:03 -07:00
Ying Sheng
0f4fb19bc8 [Fix, LoRA] fix LoRA with updates in main (#1545) 2024-09-30 10:06:08 -07:00
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Ying Sheng
9aa6553d2a [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525) 2024-09-27 23:32:11 -07:00
Lianmin Zheng
9ba1f09760 [Fix] Fix logprob and normalized_logprob (#1428) 2024-09-15 06:36:06 -07:00
Ke Bao
33b54e7c40 Add pytorch sampling backend ut (#1425) 2024-09-15 01:15:30 +10:00
Ying Sheng
eb02c1618a [Minor, CI] remove lora test from minimal suite (#1406) 2024-09-12 16:49:50 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
havetc
9935f97b3e [FEAT] JSON constrained support (#1125)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 09:37:26 -07:00
Mingyi
97589a60a2 [CI] Parallelize unit tests in CI (#1219) 2024-08-26 04:54:02 +00:00
Mingyi
7514b9f8d3 [CI] Fix CI (#1217) 2024-08-26 02:56:42 +00:00
Ying Sheng
308d024092 [CI] Fix the issue of unit test hanging (#1211) 2024-08-25 16:21:37 -07:00
Ying Sheng
ab4990e4bf [Minor] Temporarily skip flaky test (#1209) 2024-08-25 14:49:23 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00