Commit Graph

742 Commits

Author SHA1 Message Date
Ke Bao
68f8b60d22 Fix chunked prefill condition (#1594) 2024-10-07 06:34:14 +00:00
Lianmin Zheng
6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
2024-10-06 22:54:05 -07:00
Byron Hsu
565b05f02f Use atexit hook to implicitly shutdown Runtime (#1595) 2024-10-07 05:18:45 +00:00
Lianmin Zheng
b6aad70ab1 [Fix] Fix the case where prompt_len = 0 (#1593) 2024-10-06 20:30:02 -07:00
Byron Hsu
551a3a9d38 Provide an offline engine API (#1567) 2024-10-06 20:27:03 -07:00
Lianmin Zheng
91877a9f9c Fix modality for image inputs (#1592) 2024-10-06 15:43:32 -07:00
Ying Sheng
c98e84c21e [Minor, Performance] Use torch.argmax for greedy sampling (#1589) 2024-10-06 13:15:05 -07:00
Ying Sheng
9c064bf78a [LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587) 2024-10-06 10:33:44 -07:00
Lianmin Zheng
58d1082e39 Clean up event loop (#1586) 2024-10-06 03:24:04 -07:00
HAI
4d086719e5 [Bug] Fix decode stats error on output_len 1 (#1585) 2024-10-06 08:09:09 +00:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Byron Hsu
2422de5193 Support min_tokens in sgl.gen (#1573) 2024-10-05 21:51:12 -07:00
Byron Hsu
521f862d90 Fix runtime.generate when sampling param is not passed (#1582) 2024-10-05 17:59:05 -07:00
Byron Hsu
34c32d2820 Fix styling (#1583) 2024-10-05 17:52:14 -07:00
Byron Hsu
dde8bb16fe default sampling param should be deepcopied (#1581) 2024-10-05 17:27:43 -07:00
Byron Hsu
8ac3ccc060 Backend method not found when SRT Runtime is used (#1576) 2024-10-05 11:47:35 -07:00
Jerry Zhang
9b0926ceeb Add llama implementation with no tensor parallel linears (#1561) 2024-10-05 11:22:27 -07:00
Byron Hsu
6bfdb4031d [Easy] use .text() instead of .text (#1577) 2024-10-05 11:07:41 -07:00
Liangsheng Yin
5d0ba4038f Refine the add request reasons to avoid corner cases. (#1574) 2024-10-04 18:00:18 -07:00
Ying Sheng
04b262cd91 [Fix] Fix major performance bug in certain cases (#1563)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-10-04 08:51:11 +00:00
Lianmin Zheng
45473d4b2b Make input_ids a torch.Tensor (#1568) 2024-10-04 01:09:59 -07:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
HAI
e0b5dbcec1 [FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale (#1559) 2024-10-03 01:52:26 -07:00
Minsang Song
e6852b0dd2 [Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' (#1536)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-10-02 20:41:15 -07:00
Lianmin Zheng
4ae0969c0a Move status check in the memory pool to CPU (#1557) 2024-10-02 18:23:35 -07:00
Lianmin Zheng
317631cada [Fix] Move ScheduleBatch out of SamplingInfo (#1556) 2024-10-02 17:18:04 -07:00
Lianmin Zheng
b564835364 [Fix] do not maintain regex_fsm in SamplingBatchInfo (#1555) 2024-10-02 13:19:44 -07:00
Theresa Barton
2c7d0a5b8b [Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00
kk
8cdc76f6d4 [Performance, Hardware] MoE tuning on AMD MI300x GPUs (#1554)
Co-authored-by: wunhuang <wunhuang@amd.com>
2024-10-02 09:52:46 -07:00
Ying Sheng
f202ed9712 [Refactor] Simplify io_struct and tokenizer_manager (#1549) 2024-10-01 10:25:32 -07:00
Liangsheng Yin
100f5b8bc9 Simplify flashinfer dispatch (#1552) 2024-10-01 00:28:42 -07:00
Liangsheng Yin
619bb6ddda Dispatch flashinfer wrappers (#1550) 2024-09-30 23:12:36 -07:00
Liangsheng Yin
b88ea90d4a Fix bugs of logprobs_nums (#1548) 2024-09-30 17:09:54 -07:00
Liangsheng Yin
99ec439da4 Organize Attention Backends (#1547) 2024-09-30 15:54:18 -07:00
Ying Sheng
0f4fb19bc8 [Fix, LoRA] fix LoRA with updates in main (#1545) 2024-09-30 10:06:08 -07:00
Lianmin Zheng
63ba2f8d7b Clean up batch data structures: Introducing ModelWorkerBatch (#1544) 2024-09-30 06:41:49 -07:00
Lianmin Zheng
36d5acfca5 Rename InputMetadata -> ForwardBatch (#1543) 2024-09-30 02:41:11 -07:00
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Liangsheng Yin
55b974f96f Process image in parallel (#1539) 2024-09-29 18:52:43 -07:00
Lianmin Zheng
f86c1e611f Move scheduler code from tp_worker.py to scheduler.py (#1538) 2024-09-29 17:42:45 -07:00
Xinyu Yang
acaffd233f [Fix] fix ipv6 url when warm up model (#1537) 2024-09-29 11:02:40 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Liangsheng Yin
fd9ad817ec Organize image inputs (#1531) 2024-09-29 06:28:55 +00:00
Lianmin Zheng
e165a9fc1b Make detokenizer_manager.py not asyncio (#1532) 2024-09-28 19:33:09 -07:00
Lianmin Zheng
4e4459b91f Multiple minor fixes (#1530) 2024-09-28 14:43:35 -07:00
Jeffrey Fong
065bb94753 Fix RuntimeEndpoint.select method (#1495) 2024-09-28 14:04:06 -07:00
Kylin
f42e9bfb52 [bugfix] Add modelscope package to avoid docker image without modelscope (#1520) 2024-09-28 12:43:22 -07:00
Ninglin Du
840c5dbcb3 [FIX] Catch syntax error of Regex Guide to avoid crash (#1521) 2024-09-28 12:42:06 -07:00
Jerry Zhang
63e845d0bb Add float8 dynamic quant to torchao_utils (#1528) 2024-09-28 12:27:54 -07:00