sglang

Author	SHA1	Message	Date
HAI	e11ab79e68	[Performance, hardware] MoE tuning update to AMD MI300x GPUs (#1619 )	2024-10-10 22:48:15 -07:00
Byron Hsu	01fdb2f377	Fix test_vision_openai_server on CI (#1620 )	2024-10-10 16:34:13 -07:00
Amos You	c996e8ccd4	[Minor] Fix logging typo (#1615 )	2024-10-08 21:11:19 -07:00
Lianmin Zheng	7b69d91b4f	Release v0.3.3 (#1605 )	2024-10-08 12:58:41 -07:00
Byron Hsu	e8613df071	[Engine] Fix generate hanging issue after the first call (#1606 )	2024-10-08 04:26:56 +00:00
Ying Sheng	c5325aba75	[Profile] Add pytorch profiler (#1604 )	2024-10-07 14:37:16 -07:00
Lianmin Zheng	ebbc42d989	Optimize broadcast & Reorg code (#1598 )	2024-10-07 13:19:23 -07:00
Jani Monoses	3ff641132e	Remove references to squeezellm (#1603 )	2024-10-07 11:30:41 -07:00
Lianmin Zheng	2b302b9393	Fix the port_args in bench_latency (#1597 )	2024-10-07 00:44:38 -07:00
Ke Bao	68f8b60d22	Fix chunked prefill condition (#1594 )	2024-10-07 06:34:14 +00:00
Lianmin Zheng	6a5b352aaf	Use is_flashinfer_available to replace is_hip for flashinfer check (#1596 ) Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>	2024-10-06 22:54:05 -07:00
Byron Hsu	565b05f02f	Use `atexit` hook to implicitly shutdown `Runtime` (#1595 )	2024-10-07 05:18:45 +00:00
Lianmin Zheng	b6aad70ab1	[Fix] Fix the case where prompt_len = 0 (#1593 )	2024-10-06 20:30:02 -07:00
Byron Hsu	551a3a9d38	Provide an offline engine API (#1567 )	2024-10-06 20:27:03 -07:00
Lianmin Zheng	91877a9f9c	Fix modality for image inputs (#1592 )	2024-10-06 15:43:32 -07:00
Ying Sheng	c98e84c21e	[Minor, Performance] Use torch.argmax for greedy sampling (#1589 )	2024-10-06 13:15:05 -07:00
Ying Sheng	9c064bf78a	[LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587 )	2024-10-06 10:33:44 -07:00
Lianmin Zheng	58d1082e39	Clean up event loop (#1586 )	2024-10-06 03:24:04 -07:00
HAI	4d086719e5	[Bug] Fix decode stats error on output_len 1 (#1585 )	2024-10-06 08:09:09 +00:00
Lianmin Zheng	9244f27f0a	[Minor] Improve the style and fix flaky tests (#1584 )	2024-10-06 00:10:48 -07:00
Byron Hsu	2422de5193	Support min_tokens in sgl.gen (#1573 )	2024-10-05 21:51:12 -07:00
Byron Hsu	521f862d90	Fix runtime.generate when sampling param is not passed (#1582 )	2024-10-05 17:59:05 -07:00
Byron Hsu	34c32d2820	Fix styling (#1583 )	2024-10-05 17:52:14 -07:00
Byron Hsu	dde8bb16fe	default sampling param should be deepcopied (#1581 )	2024-10-05 17:27:43 -07:00
Byron Hsu	8ac3ccc060	Backend method not found when SRT Runtime is used (#1576 )	2024-10-05 11:47:35 -07:00
Jerry Zhang	9b0926ceeb	Add llama implementation with no tensor parallel linears (#1561 )	2024-10-05 11:22:27 -07:00
Byron Hsu	6bfdb4031d	[Easy] use .text() instead of .text (#1577 )	2024-10-05 11:07:41 -07:00
Liangsheng Yin	5d0ba4038f	Refine the add request reasons to avoid corner cases. (#1574 )	2024-10-04 18:00:18 -07:00
Ying Sheng	04b262cd91	[Fix] Fix major performance bug in certain cases (#1563 ) Co-authored-by: hnyls2002 <hnyls2002@gmail.com>	2024-10-04 08:51:11 +00:00
Lianmin Zheng	45473d4b2b	Make input_ids a torch.Tensor (#1568 )	2024-10-04 01:09:59 -07:00
Lianmin Zheng	114bbc8651	Use ipc instead of tcp in zmq (#1566 )	2024-10-04 00:45:52 -07:00
Lianmin Zheng	32eb6e96f2	Organize sampling batch info better (#1562 )	2024-10-03 18:29:49 -07:00
HAI	e0b5dbcec1	[FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale (#1559 )	2024-10-03 01:52:26 -07:00
Minsang Song	e6852b0dd2	[Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' (#1536 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-10-02 20:41:15 -07:00
Lianmin Zheng	4ae0969c0a	Move status check in the memory pool to CPU (#1557 )	2024-10-02 18:23:35 -07:00
Lianmin Zheng	317631cada	[Fix] Move ScheduleBatch out of SamplingInfo (#1556 )	2024-10-02 17:18:04 -07:00
Lianmin Zheng	b564835364	[Fix] do not maintain regex_fsm in SamplingBatchInfo (#1555 )	2024-10-02 13:19:44 -07:00
Theresa Barton	2c7d0a5b8b	[Fix] Fix all the Huggingface paths (#1553 )	2024-10-02 10:12:07 -07:00
kk	8cdc76f6d4	[Performance, Hardware] MoE tuning on AMD MI300x GPUs (#1554 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2024-10-02 09:52:46 -07:00
Ying Sheng	f202ed9712	[Refactor] Simplify io_struct and tokenizer_manager (#1549 )	2024-10-01 10:25:32 -07:00
Liangsheng Yin	100f5b8bc9	Simplify flashinfer dispatch (#1552 )	2024-10-01 00:28:42 -07:00
Liangsheng Yin	619bb6ddda	Dispatch flashinfer wrappers (#1550 )	2024-09-30 23:12:36 -07:00
Liangsheng Yin	b88ea90d4a	Fix bugs of `logprobs_nums` (#1548 )	2024-09-30 17:09:54 -07:00
Liangsheng Yin	99ec439da4	Organize Attention Backends (#1547 )	2024-09-30 15:54:18 -07:00
Ying Sheng	0f4fb19bc8	[Fix, LoRA] fix LoRA with updates in main (#1545 )	2024-09-30 10:06:08 -07:00
Lianmin Zheng	63ba2f8d7b	Clean up batch data structures: Introducing ModelWorkerBatch (#1544 )	2024-09-30 06:41:49 -07:00
Lianmin Zheng	36d5acfca5	Rename InputMetadata -> ForwardBatch (#1543 )	2024-09-30 02:41:11 -07:00
Lianmin Zheng	3f0fe08d37	Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541 )	2024-09-29 20:28:45 -07:00
Liangsheng Yin	55b974f96f	Process image in parallel (#1539 )	2024-09-29 18:52:43 -07:00
Lianmin Zheng	f86c1e611f	Move scheduler code from tp_worker.py to scheduler.py (#1538 )	2024-09-29 17:42:45 -07:00

1 2 3 4 5 ...

751 Commits