Commit Graph

  • 087257ea03 Release v0.3.4 (#1714) Lianmin Zheng 2024-10-19 08:17:41 -07:00
  • 736f04025d Update README.md (#1713) Lianmin Zheng 2024-10-19 07:11:02 -07:00
  • 769bf11c05 Fix the race condition in overlap mode (#1712) Lianmin Zheng 2024-10-19 06:50:56 -07:00
  • 3db43d1b08 Fix is_all_ready for overlap copy (#1710) Lianmin Zheng 2024-10-18 21:01:52 -07:00
  • f0f8a7699b Simplify the nan detection and greedy check in sampler (#1709) Lianmin Zheng 2024-10-18 20:21:24 -07:00
  • 2bcfba1b08 Skip unnecessary penalizer (#1707) Lianmin Zheng 2024-10-18 17:54:03 -07:00
  • bc12d4033f Add grouped free operations (#1706) Lianmin Zheng 2024-10-18 13:21:05 -07:00
  • 392f2863c8 Add dtype for more operations (#1705) Lianmin Zheng 2024-10-18 12:18:15 -07:00
  • 6d0fa73ece Simplify flashinfer utilities (#1704) Lianmin Zheng 2024-10-17 22:54:14 -07:00
  • 9e0dac1ad7 Fix regex and logprob conflicts when chunked prefilling (#1703) Liangsheng Yin 2024-10-17 18:33:21 -07:00
  • a95d5589c3 Add matched_stop token or str to distinguish between eos or stop str finish_reason generation (#1684) Gleb Drozdov 2024-10-17 22:06:52 +04:00
  • d17d19e5b8 Fix mixed batch for multi modal models (#1702) Lianmin Zheng 2024-10-17 10:27:26 -07:00
  • dd3809fad8 Fix engine unit test (#1701) Lianmin Zheng 2024-10-17 09:53:32 -07:00
  • 7feba41584 Fix failed ci tests on long prompts; Better error messages for embedding models (#1700) Lianmin Zheng 2024-10-17 09:23:29 -07:00
  • 30ee36305e Fix the failed unit tests (#1699) Lianmin Zheng 2024-10-17 08:13:29 -07:00
  • e5db40dcbc ORJson. Faster Json serialization (#1694) Michael Feil 2024-10-17 08:03:08 -07:00
  • b170930534 feat: radix tree code optimize (#1697) wxsm 2024-10-17 23:01:27 +08:00
  • 5ab20cceba Use SGLang imports for linear layer (#1696) Jani Monoses 2024-10-17 17:50:01 +03:00
  • 02f7f3e488 Update the transformers version in CI (#1690) Lianmin Zheng 2024-10-16 19:03:55 -07:00
  • 2782132be8 Add date to logging messages (#1623) (#1679) Zeng Zhongchao 2024-10-17 09:54:55 +08:00
  • d19cc0b9c9 Update README.md (#1689) Lianmin Zheng 2024-10-16 18:36:24 -07:00
  • b0facb3316 add orjson for jsonresponse (#1688) Michael Feil 2024-10-16 18:14:30 -07:00
  • ecb8bad276 Returning a per request metric for number of cached_tokens read (#1599) havetc 2024-10-16 20:49:22 +02:00
  • dbec2f1847 Launch a thread to overlap CPU and GPU (#1687) Lianmin Zheng 2024-10-16 11:20:17 -07:00
  • e4b367baa8 [Event] Add online meetup meeting link (#1686) Ying Sheng 2024-10-16 10:58:14 -07:00
  • d10b933a36 Fix srt dependency (#1685) Ke Bao 2024-10-16 23:21:20 +08:00
  • 9116b2896f Add a new event loop (#1677) Lianmin Zheng 2024-10-16 01:33:20 -07:00
  • a5114b6f91 Add OLMo model (#1676) Jani Monoses 2024-10-16 10:11:18 +03:00
  • b6b4094621 Fix filter_batch function call (#1681) Liangsheng Yin 2024-10-15 22:59:26 -07:00
  • f1088e0fc8 Fix memory leak during abort (#1674) Lianmin Zheng 2024-10-15 08:15:08 -07:00
  • 175afed370 Improve benchmark scripts (#1672) Lianmin Zheng 2024-10-14 21:53:01 -07:00
  • 4a292f670d [Minor] Add some utility functions (#1671) Lianmin Zheng 2024-10-14 20:08:03 -07:00
  • cd0be7489f [doc] improve engine doc and add to readme (#1670) Byron Hsu 2024-10-14 19:56:21 -07:00
  • 56503d9bc9 [1/N] Remove CacheConfig import in all model files (#1658) Byron Hsu 2024-10-14 09:06:34 -07:00
  • 02bc95796d Simplify chunked prefill (#1667) Lianmin Zheng 2024-10-14 06:47:50 -07:00
  • 24f3e1511c [Minor] Improve style (#1666) Lianmin Zheng 2024-10-14 05:25:00 -07:00
  • 6790240cc3 Fix unit test order to balance the tasks in CI (#1665) Lianmin Zheng 2024-10-14 02:01:44 -07:00
  • 061e546313 Support double sparsity (#1459) Shuo Yang 2024-10-14 02:00:41 -07:00
  • 0c1e87964b Move filter_batch out of stream_output (#1663) Lianmin Zheng 2024-10-14 01:15:34 -07:00
  • 869f1c02c4 Add a test case to test retract (#1662) Lianmin Zheng 2024-10-13 20:32:37 -07:00
  • 2725f8da61 [Minor] Rename no_eos_trim to no_stop_trim (#1661) Ying Sheng 2024-10-13 20:30:03 -07:00
  • da1ffed689 Add output_ids into ScheduleBatch (#1659) Lianmin Zheng 2024-10-13 19:54:02 -07:00
  • 4876117171 [Fix] fix eos trim inconsistency (#1650) Ying Sheng 2024-10-13 01:07:09 -07:00
  • c3f2fc5a7a [doc] Add engine section in backend.md (#1656) Byron Hsu 2024-10-13 00:33:58 -07:00
  • 7ee6c259ff Simplify the event loop and expose --num-continuous-decode-steps as an argument (#1652) Lianmin Zheng 2024-10-12 21:35:30 -07:00
  • 9610fcd469 Fix the batch_is_full check for jump-forward decoding (#1654) Lianmin Zheng 2024-10-12 19:47:24 -07:00
  • 31fad29ab0 Add get_tokenizer function for Engine class (#1653) Patrick Yi 2024-10-12 22:39:35 -04:00
  • 9da5a60b18 Add an option to disable penalizer (#1651) Lianmin Zheng 2024-10-12 17:53:23 -07:00
  • 69aa937aa5 Fix unit tests and type annotations (#1648) Lianmin Zheng 2024-10-12 14:49:24 -07:00
  • 5d638c92f5 [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480) Zhang, Liangang 2024-10-13 02:10:32 +08:00
  • e37cdab0c6 Fix ignore_eos (#1645) Lianmin Zheng 2024-10-12 00:36:28 -07:00
  • 1d9deeacdb fix missing ignore_eos in v1/chat/completions (#1642) LI MOU 2024-10-12 12:37:20 +08:00
  • dafb6a5266 [Fix] Fix the style of test_large_max_new_tokens.py (#1638) Lianmin Zheng 2024-10-11 16:05:58 -07:00
  • 862cd265e5 [engine] support async and streaming (#1614) Byron Hsu 2024-10-11 15:26:25 -07:00
  • 00c7e6368b Release v0.3.3.post1 (#1636) Lianmin Zheng 2024-10-11 07:56:16 -07:00
  • 23cc66f7b6 Add back data parallelism (#1635) Lianmin Zheng 2024-10-11 07:22:48 -07:00
  • 5d09ca5735 Fix constrained decoding (#1634) Lianmin Zheng 2024-10-11 06:26:20 -07:00
  • 81c3327402 Added a "Back To Top" Button (#1633) Janumala Akhilendra 2024-10-11 18:55:30 +05:30
  • f13d86f920 Add image_token in conversation.py (#1632) Lianmin Zheng 2024-10-11 05:07:51 -07:00
  • aba9eae4c6 Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py (#1631) Lianmin Zheng 2024-10-11 05:03:20 -07:00
  • bbd72bfc86 Add the ability to enable and disable the Profiler via HTTP API. (#1626) 科英 2024-10-11 17:34:25 +08:00
  • b503881bd2 [Bug] Fix the Image Input of Batch Generation (#1579) Yiding-Lu 2024-10-11 17:25:04 +08:00
  • 58093b868f Nit about the decorator of PortArgs.init_new (#1611) glen-amd 2024-10-11 02:17:47 -07:00
  • 8275049ce3 Add device support (#1607) Zhang, Liangang 2024-10-11 17:05:58 +08:00
  • 5476ccad8f Update README.md Lianmin Zheng 2024-10-11 01:59:49 -07:00
  • b040ed71f7 Update README.md (#1629) Lianmin Zheng 2024-10-11 01:58:25 -07:00
  • c9e6658699 Update README.md (#1625) Kushal Agrawal 2024-10-11 14:27:42 +05:30
  • e11ab79e68 [Performance, hardware] MoE tuning update to AMD MI300x GPUs (#1619) HAI 2024-10-10 22:48:15 -07:00
  • 01fdb2f377 Fix test_vision_openai_server on CI (#1620) Byron Hsu 2024-10-10 16:34:13 -07:00
  • c996e8ccd4 [Minor] Fix logging typo (#1615) Amos You 2024-10-08 21:11:19 -07:00
  • 7b69d91b4f Release v0.3.3 (#1605) Lianmin Zheng 2024-10-08 12:58:41 -07:00
  • e8613df071 [Engine] Fix generate hanging issue after the first call (#1606) Byron Hsu 2024-10-07 21:26:56 -07:00
  • c5325aba75 [Profile] Add pytorch profiler (#1604) Ying Sheng 2024-10-07 14:37:16 -07:00
  • ebbc42d989 Optimize broadcast & Reorg code (#1598) Lianmin Zheng 2024-10-07 13:05:53 -07:00
  • 3ff641132e Remove references to squeezellm (#1603) Jani Monoses 2024-10-07 21:30:41 +03:00
  • 2b302b9393 Fix the port_args in bench_latency (#1597) Lianmin Zheng 2024-10-07 00:44:38 -07:00
  • 68f8b60d22 Fix chunked prefill condition (#1594) Ke Bao 2024-10-07 14:34:14 +08:00
  • 6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596) Lianmin Zheng 2024-10-06 22:54:05 -07:00
  • 565b05f02f Use atexit hook to implicitly shutdown Runtime (#1595) Byron Hsu 2024-10-06 22:18:45 -07:00
  • b6aad70ab1 [Fix] Fix the case where prompt_len = 0 (#1593) Lianmin Zheng 2024-10-06 20:30:02 -07:00
  • 551a3a9d38 Provide an offline engine API (#1567) Byron Hsu 2024-10-06 20:27:03 -07:00
  • 91877a9f9c Fix modality for image inputs (#1592) Lianmin Zheng 2024-10-06 15:43:32 -07:00
  • f7cce751f9 Update README.md (#1591) Lianmin Zheng 2024-10-06 15:14:29 -07:00
  • 17e998f1a8 Test consistency for single and batch seperately (#1590) Byron Hsu 2024-10-06 15:02:27 -07:00
  • c98e84c21e [Minor, Performance] Use torch.argmax for greedy sampling (#1589) Ying Sheng 2024-10-06 13:15:05 -07:00
  • 9c064bf78a [LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587) Ying Sheng 2024-10-06 10:33:44 -07:00
  • 58d1082e39 Clean up event loop (#1586) Lianmin Zheng 2024-10-06 03:24:04 -07:00
  • 4d086719e5 [Bug] Fix decode stats error on output_len 1 (#1585) HAI 2024-10-06 01:09:09 -07:00
  • 9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) Lianmin Zheng 2024-10-06 00:10:48 -07:00
  • 2422de5193 Support min_tokens in sgl.gen (#1573) Byron Hsu 2024-10-05 21:51:12 -07:00
  • 521f862d90 Fix runtime.generate when sampling param is not passed (#1582) Byron Hsu 2024-10-05 17:59:05 -07:00
  • 34c32d2820 Fix styling (#1583) Byron Hsu 2024-10-05 17:52:14 -07:00
  • dde8bb16fe default sampling param should be deepcopied (#1581) Byron Hsu 2024-10-05 17:27:43 -07:00
  • 8ac3ccc060 Backend method not found when SRT Runtime is used (#1576) Byron Hsu 2024-10-05 11:47:35 -07:00
  • 9b0926ceeb Add llama implementation with no tensor parallel linears (#1561) Jerry Zhang 2024-10-05 11:22:27 -07:00
  • 1c1bdc7699 [Event] Update README.md (#1572) Ying Sheng 2024-10-05 11:16:47 -07:00
  • 6bfdb4031d [Easy] use .text() instead of .text (#1577) Byron Hsu 2024-10-05 11:07:41 -07:00
  • f8fb4ce9b0 chore: update README.md (#1580) Ikko Eltociear Ashimine 2024-10-06 03:05:57 +09:00
  • 5d0ba4038f Refine the add request reasons to avoid corner cases. (#1574) Liangsheng Yin 2024-10-04 18:00:18 -07:00
  • 04b262cd91 [Fix] Fix major performance bug in certain cases (#1563) Ying Sheng 2024-10-04 01:51:11 -07:00