Lianmin Zheng
|
7ee6c259ff
|
Simplify the event loop and expose --num-continuous-decode-steps as an argument (#1652)
|
2024-10-12 21:35:30 -07:00 |
|
Lianmin Zheng
|
9610fcd469
|
Fix the batch_is_full check for jump-forward decoding (#1654)
|
2024-10-12 19:47:24 -07:00 |
|
Patrick Yi
|
31fad29ab0
|
Add get_tokenizer function for Engine class (#1653)
|
2024-10-12 19:39:35 -07:00 |
|
Lianmin Zheng
|
9da5a60b18
|
Add an option to disable penalizer (#1651)
|
2024-10-12 17:53:23 -07:00 |
|
Lianmin Zheng
|
69aa937aa5
|
Fix unit tests and type annotations (#1648)
|
2024-10-12 14:49:24 -07:00 |
|
Zhang, Liangang
|
5d638c92f5
|
[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480)
|
2024-10-12 18:10:32 +00:00 |
|
Lianmin Zheng
|
e37cdab0c6
|
Fix ignore_eos (#1645)
|
2024-10-12 00:36:28 -07:00 |
|
LI MOU
|
1d9deeacdb
|
fix missing ignore_eos in v1/chat/completions (#1642)
|
2024-10-11 21:37:20 -07:00 |
|
Byron Hsu
|
862cd265e5
|
[engine] support async and streaming (#1614)
|
2024-10-11 15:26:25 -07:00 |
|
Lianmin Zheng
|
00c7e6368b
|
Release v0.3.3.post1 (#1636)
|
2024-10-11 07:56:16 -07:00 |
|
Lianmin Zheng
|
23cc66f7b6
|
Add back data parallelism (#1635)
|
2024-10-11 07:22:48 -07:00 |
|
Lianmin Zheng
|
5d09ca5735
|
Fix constrained decoding (#1634)
|
2024-10-11 06:26:20 -07:00 |
|
Lianmin Zheng
|
f13d86f920
|
Add image_token in conversation.py (#1632)
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2024-10-11 05:07:51 -07:00 |
|
Lianmin Zheng
|
aba9eae4c6
|
Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py (#1631)
|
2024-10-11 05:03:20 -07:00 |
|
科英
|
bbd72bfc86
|
Add the ability to enable and disable the Profiler via HTTP API. (#1626)
|
2024-10-11 02:34:25 -07:00 |
|
Yiding-Lu
|
b503881bd2
|
[Bug] Fix the Image Input of Batch Generation (#1579)
|
2024-10-11 02:25:04 -07:00 |
|
glen-amd
|
58093b868f
|
Nit about the decorator of PortArgs.init_new (#1611)
|
2024-10-11 02:17:47 -07:00 |
|
Zhang, Liangang
|
8275049ce3
|
Add device support (#1607)
|
2024-10-11 02:05:58 -07:00 |
|
HAI
|
e11ab79e68
|
[Performance, hardware] MoE tuning update to AMD MI300x GPUs (#1619)
|
2024-10-10 22:48:15 -07:00 |
|
Byron Hsu
|
01fdb2f377
|
Fix test_vision_openai_server on CI (#1620)
|
2024-10-10 16:34:13 -07:00 |
|
Amos You
|
c996e8ccd4
|
[Minor] Fix logging typo (#1615)
|
2024-10-08 21:11:19 -07:00 |
|
Lianmin Zheng
|
7b69d91b4f
|
Release v0.3.3 (#1605)
|
2024-10-08 12:58:41 -07:00 |
|
Byron Hsu
|
e8613df071
|
[Engine] Fix generate hanging issue after the first call (#1606)
|
2024-10-08 04:26:56 +00:00 |
|
Ying Sheng
|
c5325aba75
|
[Profile] Add pytorch profiler (#1604)
|
2024-10-07 14:37:16 -07:00 |
|
Lianmin Zheng
|
ebbc42d989
|
Optimize broadcast & Reorg code (#1598)
|
2024-10-07 13:19:23 -07:00 |
|
Jani Monoses
|
3ff641132e
|
Remove references to squeezellm (#1603)
|
2024-10-07 11:30:41 -07:00 |
|
Lianmin Zheng
|
2b302b9393
|
Fix the port_args in bench_latency (#1597)
|
2024-10-07 00:44:38 -07:00 |
|
Ke Bao
|
68f8b60d22
|
Fix chunked prefill condition (#1594)
|
2024-10-07 06:34:14 +00:00 |
|
Lianmin Zheng
|
6a5b352aaf
|
Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
|
2024-10-06 22:54:05 -07:00 |
|
Byron Hsu
|
565b05f02f
|
Use atexit hook to implicitly shutdown Runtime (#1595)
|
2024-10-07 05:18:45 +00:00 |
|
Lianmin Zheng
|
b6aad70ab1
|
[Fix] Fix the case where prompt_len = 0 (#1593)
|
2024-10-06 20:30:02 -07:00 |
|
Byron Hsu
|
551a3a9d38
|
Provide an offline engine API (#1567)
|
2024-10-06 20:27:03 -07:00 |
|
Lianmin Zheng
|
91877a9f9c
|
Fix modality for image inputs (#1592)
|
2024-10-06 15:43:32 -07:00 |
|
Ying Sheng
|
c98e84c21e
|
[Minor, Performance] Use torch.argmax for greedy sampling (#1589)
|
2024-10-06 13:15:05 -07:00 |
|
Ying Sheng
|
9c064bf78a
|
[LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587)
|
2024-10-06 10:33:44 -07:00 |
|
Lianmin Zheng
|
58d1082e39
|
Clean up event loop (#1586)
|
2024-10-06 03:24:04 -07:00 |
|
HAI
|
4d086719e5
|
[Bug] Fix decode stats error on output_len 1 (#1585)
|
2024-10-06 08:09:09 +00:00 |
|
Lianmin Zheng
|
9244f27f0a
|
[Minor] Improve the style and fix flaky tests (#1584)
|
2024-10-06 00:10:48 -07:00 |
|
Byron Hsu
|
2422de5193
|
Support min_tokens in sgl.gen (#1573)
|
2024-10-05 21:51:12 -07:00 |
|
Byron Hsu
|
521f862d90
|
Fix runtime.generate when sampling param is not passed (#1582)
|
2024-10-05 17:59:05 -07:00 |
|
Byron Hsu
|
34c32d2820
|
Fix styling (#1583)
|
2024-10-05 17:52:14 -07:00 |
|
Byron Hsu
|
dde8bb16fe
|
default sampling param should be deepcopied (#1581)
|
2024-10-05 17:27:43 -07:00 |
|
Byron Hsu
|
8ac3ccc060
|
Backend method not found when SRT Runtime is used (#1576)
|
2024-10-05 11:47:35 -07:00 |
|
Jerry Zhang
|
9b0926ceeb
|
Add llama implementation with no tensor parallel linears (#1561)
|
2024-10-05 11:22:27 -07:00 |
|
Byron Hsu
|
6bfdb4031d
|
[Easy] use .text() instead of .text (#1577)
|
2024-10-05 11:07:41 -07:00 |
|
Liangsheng Yin
|
5d0ba4038f
|
Refine the add request reasons to avoid corner cases. (#1574)
|
2024-10-04 18:00:18 -07:00 |
|
Ying Sheng
|
04b262cd91
|
[Fix] Fix major performance bug in certain cases (#1563)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
|
2024-10-04 08:51:11 +00:00 |
|
Lianmin Zheng
|
45473d4b2b
|
Make input_ids a torch.Tensor (#1568)
|
2024-10-04 01:09:59 -07:00 |
|
Lianmin Zheng
|
114bbc8651
|
Use ipc instead of tcp in zmq (#1566)
|
2024-10-04 00:45:52 -07:00 |
|
Lianmin Zheng
|
32eb6e96f2
|
Organize sampling batch info better (#1562)
|
2024-10-03 18:29:49 -07:00 |
|