sglang

Author	SHA1	Message	Date
Ying Sheng	04b262cd91	[Fix] Fix major performance bug in certain cases (#1563 ) Co-authored-by: hnyls2002 <hnyls2002@gmail.com>	2024-10-04 08:51:11 +00:00
Lianmin Zheng	45473d4b2b	Make input_ids a torch.Tensor (#1568 )	2024-10-04 01:09:59 -07:00
Lianmin Zheng	114bbc8651	Use ipc instead of tcp in zmq (#1566 )	2024-10-04 00:45:52 -07:00
Lianmin Zheng	32eb6e96f2	Organize sampling batch info better (#1562 )	2024-10-03 18:29:49 -07:00
HAI	e0b5dbcec1	[FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale (#1559 )	2024-10-03 01:52:26 -07:00
Minsang Song	e6852b0dd2	[Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' (#1536 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-10-02 20:41:15 -07:00
Lianmin Zheng	4ae0969c0a	Move status check in the memory pool to CPU (#1557 )	2024-10-02 18:23:35 -07:00
Lianmin Zheng	317631cada	[Fix] Move ScheduleBatch out of SamplingInfo (#1556 )	2024-10-02 17:18:04 -07:00
Lianmin Zheng	b564835364	[Fix] do not maintain regex_fsm in SamplingBatchInfo (#1555 )	2024-10-02 13:19:44 -07:00
Theresa Barton	2c7d0a5b8b	[Fix] Fix all the Huggingface paths (#1553 )	2024-10-02 10:12:07 -07:00
kk	8cdc76f6d4	[Performance, Hardware] MoE tuning on AMD MI300x GPUs (#1554 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2024-10-02 09:52:46 -07:00
Ying Sheng	f202ed9712	[Refactor] Simplify io_struct and tokenizer_manager (#1549 )	2024-10-01 10:25:32 -07:00
Liangsheng Yin	100f5b8bc9	Simplify flashinfer dispatch (#1552 )	2024-10-01 00:28:42 -07:00
Liangsheng Yin	619bb6ddda	Dispatch flashinfer wrappers (#1550 )	2024-09-30 23:12:36 -07:00
Liangsheng Yin	b88ea90d4a	Fix bugs of `logprobs_nums` (#1548 )	2024-09-30 17:09:54 -07:00
Liangsheng Yin	99ec439da4	Organize Attention Backends (#1547 )	2024-09-30 15:54:18 -07:00
Ying Sheng	0f4fb19bc8	[Fix, LoRA] fix LoRA with updates in main (#1545 )	2024-09-30 10:06:08 -07:00
Lianmin Zheng	63ba2f8d7b	Clean up batch data structures: Introducing ModelWorkerBatch (#1544 )	2024-09-30 06:41:49 -07:00
Lianmin Zheng	36d5acfca5	Rename InputMetadata -> ForwardBatch (#1543 )	2024-09-30 02:41:11 -07:00
Lianmin Zheng	3f0fe08d37	Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541 )	2024-09-29 20:28:45 -07:00
Liangsheng Yin	55b974f96f	Process image in parallel (#1539 )	2024-09-29 18:52:43 -07:00
Lianmin Zheng	f86c1e611f	Move scheduler code from tp_worker.py to scheduler.py (#1538 )	2024-09-29 17:42:45 -07:00
Xinyu Yang	acaffd233f	[Fix] fix ipv6 url when warm up model (#1537 )	2024-09-29 11:02:40 -07:00
Lianmin Zheng	048685430d	Improve process creation (#1534 )	2024-09-29 02:36:12 -07:00
Liangsheng Yin	fd9ad817ec	Organize image inputs (#1531 )	2024-09-29 06:28:55 +00:00
Lianmin Zheng	e165a9fc1b	Make detokenizer_manager.py not asyncio (#1532 )	2024-09-28 19:33:09 -07:00
Lianmin Zheng	4e4459b91f	Multiple minor fixes (#1530 )	2024-09-28 14:43:35 -07:00
Jeffrey Fong	065bb94753	Fix RuntimeEndpoint.select method (#1495 )	2024-09-28 14:04:06 -07:00
Kylin	f42e9bfb52	[bugfix] Add modelscope package to avoid docker image without modelscope (#1520 )	2024-09-28 12:43:22 -07:00
Ninglin Du	840c5dbcb3	[FIX] Catch syntax error of Regex Guide to avoid crash (#1521 )	2024-09-28 12:42:06 -07:00
Jerry Zhang	63e845d0bb	Add float8 dynamic quant to torchao_utils (#1528 )	2024-09-28 12:27:54 -07:00
Ying Sheng	9aa6553d2a	[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525 )	2024-09-27 23:32:11 -07:00
Liangsheng Yin	4353acb469	minor: fix config (#1524 )	2024-09-27 01:49:16 -07:00
Lianmin Zheng	9ae1db0bdc	[Fix] Ignore import error (#1513 )	2024-09-25 11:32:21 -07:00
Ying Sheng	37c5899fc2	Release v0.3.2 (#1512 )	2024-09-25 14:17:09 +08:00
Ying Sheng	f39a0197fd	Revert "kernel: use tensor cores for flashinfer gqa kernels" (#1511 )	2024-09-24 22:50:31 -07:00
TianyiQ	3c93187caf	Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508 )	2024-09-24 21:50:20 -07:00
Lianmin Zheng	fb2d0680e0	[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510 )	2024-09-24 21:37:33 -07:00
Lianmin Zheng	067d8e16fc	Simplify bench_latency.py (#1503 )	2024-09-24 17:42:07 -07:00
luzengxiangcn	e6692bf4a5	debug radixcache stack_overflow (#1499 )	2024-09-24 04:58:01 -07:00
Ke Bao	8d4ed42ad5	MoE torch compile (#1497 )	2024-09-24 01:46:59 -07:00
Lianmin Zheng	2854a5ea9f	Fix the overhead due to penalizer in bench_latency (#1496 )	2024-09-23 07:38:14 -07:00
Yineng Zhang	42a2d82ba7	minor: add mla fp8 test (#1494 )	2024-09-23 20:40:17 +08:00
Ying Sheng	e4780cf839	[API, Feature] Support response prefill for openai API (#1490 )	2024-09-22 06:46:17 -07:00
Lianmin Zheng	39bb49d156	Update dockerfile to include datamodel_code_generator (#1492 )	2024-09-22 04:49:16 -07:00
wellhowtosay	2a99993cd9	Pr fix max workers (#1456 ) Co-authored-by: baolujia <baolujia@shizhuang-inc.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-09-22 02:20:26 -07:00
Lianmin Zheng	167591e864	Better unit tests for adding a new model (#1488 )	2024-09-22 01:50:37 -07:00
Yineng Zhang	82136eb0b5	chore: bump v0.3.1.post3 (#1483 )	2024-09-21 11:17:45 +08:00
Ke Bao	b8ccaf4d73	Add MLA gsm8k eval (#1484 )	2024-09-21 11:16:13 +08:00
Ke Bao	a68cb201dd	Fix triton head num (#1482 )	2024-09-21 10:25:20 +08:00

1 2 3 4 5 ...

723 Commits