Commit Graph

774 Commits

Author SHA1 Message Date
Lianmin Zheng
0c1e87964b Move filter_batch out of stream_output (#1663) 2024-10-14 01:15:34 -07:00
Lianmin Zheng
869f1c02c4 Add a test case to test retract (#1662) 2024-10-13 20:32:37 -07:00
Ying Sheng
2725f8da61 [Minor] Rename no_eos_trim to no_stop_trim (#1661) 2024-10-13 20:30:03 -07:00
Lianmin Zheng
da1ffed689 Add output_ids into ScheduleBatch (#1659) 2024-10-13 19:54:02 -07:00
Ying Sheng
4876117171 [Fix] fix eos trim inconsistency (#1650) 2024-10-13 01:07:09 -07:00
Lianmin Zheng
7ee6c259ff Simplify the event loop and expose --num-continuous-decode-steps as an argument (#1652) 2024-10-12 21:35:30 -07:00
Lianmin Zheng
9610fcd469 Fix the batch_is_full check for jump-forward decoding (#1654) 2024-10-12 19:47:24 -07:00
Patrick Yi
31fad29ab0 Add get_tokenizer function for Engine class (#1653) 2024-10-12 19:39:35 -07:00
Lianmin Zheng
9da5a60b18 Add an option to disable penalizer (#1651) 2024-10-12 17:53:23 -07:00
Lianmin Zheng
69aa937aa5 Fix unit tests and type annotations (#1648) 2024-10-12 14:49:24 -07:00
Zhang, Liangang
5d638c92f5 [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480) 2024-10-12 18:10:32 +00:00
Lianmin Zheng
e37cdab0c6 Fix ignore_eos (#1645) 2024-10-12 00:36:28 -07:00
LI MOU
1d9deeacdb fix missing ignore_eos in v1/chat/completions (#1642) 2024-10-11 21:37:20 -07:00
Byron Hsu
862cd265e5 [engine] support async and streaming (#1614) 2024-10-11 15:26:25 -07:00
Lianmin Zheng
00c7e6368b Release v0.3.3.post1 (#1636) 2024-10-11 07:56:16 -07:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
Lianmin Zheng
5d09ca5735 Fix constrained decoding (#1634) 2024-10-11 06:26:20 -07:00
Lianmin Zheng
f13d86f920 Add image_token in conversation.py (#1632)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2024-10-11 05:07:51 -07:00
Lianmin Zheng
aba9eae4c6 Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py (#1631) 2024-10-11 05:03:20 -07:00
科英
bbd72bfc86 Add the ability to enable and disable the Profiler via HTTP API. (#1626) 2024-10-11 02:34:25 -07:00
Yiding-Lu
b503881bd2 [Bug] Fix the Image Input of Batch Generation (#1579) 2024-10-11 02:25:04 -07:00
glen-amd
58093b868f Nit about the decorator of PortArgs.init_new (#1611) 2024-10-11 02:17:47 -07:00
Zhang, Liangang
8275049ce3 Add device support (#1607) 2024-10-11 02:05:58 -07:00
HAI
e11ab79e68 [Performance, hardware] MoE tuning update to AMD MI300x GPUs (#1619) 2024-10-10 22:48:15 -07:00
Byron Hsu
01fdb2f377 Fix test_vision_openai_server on CI (#1620) 2024-10-10 16:34:13 -07:00
Amos You
c996e8ccd4 [Minor] Fix logging typo (#1615) 2024-10-08 21:11:19 -07:00
Lianmin Zheng
7b69d91b4f Release v0.3.3 (#1605) 2024-10-08 12:58:41 -07:00
Byron Hsu
e8613df071 [Engine] Fix generate hanging issue after the first call (#1606) 2024-10-08 04:26:56 +00:00
Ying Sheng
c5325aba75 [Profile] Add pytorch profiler (#1604) 2024-10-07 14:37:16 -07:00
Lianmin Zheng
ebbc42d989 Optimize broadcast & Reorg code (#1598) 2024-10-07 13:19:23 -07:00
Jani Monoses
3ff641132e Remove references to squeezellm (#1603) 2024-10-07 11:30:41 -07:00
Lianmin Zheng
2b302b9393 Fix the port_args in bench_latency (#1597) 2024-10-07 00:44:38 -07:00
Ke Bao
68f8b60d22 Fix chunked prefill condition (#1594) 2024-10-07 06:34:14 +00:00
Lianmin Zheng
6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
2024-10-06 22:54:05 -07:00
Byron Hsu
565b05f02f Use atexit hook to implicitly shutdown Runtime (#1595) 2024-10-07 05:18:45 +00:00
Lianmin Zheng
b6aad70ab1 [Fix] Fix the case where prompt_len = 0 (#1593) 2024-10-06 20:30:02 -07:00
Byron Hsu
551a3a9d38 Provide an offline engine API (#1567) 2024-10-06 20:27:03 -07:00
Lianmin Zheng
91877a9f9c Fix modality for image inputs (#1592) 2024-10-06 15:43:32 -07:00
Ying Sheng
c98e84c21e [Minor, Performance] Use torch.argmax for greedy sampling (#1589) 2024-10-06 13:15:05 -07:00
Ying Sheng
9c064bf78a [LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587) 2024-10-06 10:33:44 -07:00
Lianmin Zheng
58d1082e39 Clean up event loop (#1586) 2024-10-06 03:24:04 -07:00
HAI
4d086719e5 [Bug] Fix decode stats error on output_len 1 (#1585) 2024-10-06 08:09:09 +00:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Byron Hsu
2422de5193 Support min_tokens in sgl.gen (#1573) 2024-10-05 21:51:12 -07:00
Byron Hsu
521f862d90 Fix runtime.generate when sampling param is not passed (#1582) 2024-10-05 17:59:05 -07:00
Byron Hsu
34c32d2820 Fix styling (#1583) 2024-10-05 17:52:14 -07:00
Byron Hsu
dde8bb16fe default sampling param should be deepcopied (#1581) 2024-10-05 17:27:43 -07:00
Byron Hsu
8ac3ccc060 Backend method not found when SRT Runtime is used (#1576) 2024-10-05 11:47:35 -07:00
Jerry Zhang
9b0926ceeb Add llama implementation with no tensor parallel linears (#1561) 2024-10-05 11:22:27 -07:00
Byron Hsu
6bfdb4031d [Easy] use .text() instead of .text (#1577) 2024-10-05 11:07:41 -07:00