sglang

Author	SHA1	Message	Date
Liangsheng Yin	e7ebecf82e	Fix cache hit rate when chunked prefill (#2555 )	2024-12-26 03:14:28 -08:00
Xiaoyu Zhang	9a23c48456	h100 tuning fused_moe_triton for qwen2 moe (#2560 )	2024-12-26 03:13:31 -08:00
Yineng Zhang	635a042623	docs: update deepseek v3 example (#2592 )	2024-12-26 17:43:37 +08:00
Yineng Zhang	2dccecf432	fix: only enable moe_align_block_size for now (#2590 )	2024-12-26 16:56:59 +08:00
Yineng Zhang	75ad0a143f	docs: add deepseek v3 launch instructions (#2589 )	2024-12-25 23:26:54 -08:00
Yineng Zhang	efc52f85e2	chore: bump v0.4.1 (#2582 )	2024-12-26 07:14:51 +08:00
Yineng Zhang	60e2fdcf4f	use sgl-kernel moe_align_block_size (#2581 ) Co-authored-by: ispobock <ispobaoke@163.com> Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-26 06:29:08 +08:00
Yineng Zhang	d7c0e872b0	chore: bump 0.0.2.post8 for sgl-kernel (#2580 )	2024-12-26 06:11:39 +08:00
Yineng Zhang	31548116a8	fix moe_align_block_size_kernel for shared memory issue (#2579 ) Co-authored-by: ispobock <ispobaoke@163.com>	2024-12-26 05:31:04 +08:00
HandH1998	53aed988cb	Refactor MoE (#2575 ) Co-authored-by: zhyncs <me@zhyncs.com>	2024-12-26 00:02:14 +08:00
Ying Sheng	8a56b43175	[Bench] Flush cache before benchmarking (#2566 )	2024-12-24 11:21:21 +08:00
Ke Bao	e835a50021	Reorg moe code (#2563 )	2024-12-24 01:10:22 +08:00
Lianmin Zheng	23e5e50fd5	Fix gemlite import (#2553 )	2024-12-22 20:21:17 -08:00
Shi Shuai	25e5d589e3	Doc: Update Grammar Backend (#2545 ) Co-authored-by: Chayenne <zhaochen20@outlook.com>	2024-12-22 17:14:40 -08:00
Lianmin Zheng	41b1db69b8	A better aio rwlock that guarantees the order (#2547 )	2024-12-22 15:44:32 -08:00
Lianmin Zheng	8496701934	[Misc] Fix metrics, weight update lock, request logging (#2543 )	2024-12-22 06:27:22 -08:00
Xiaoyu Zhang	7d672d277b	[kernel optimize] benchmark write_req_to_token_pool_triton and optimize kernel (#2509 )	2024-12-22 02:31:02 -08:00
Yineng Zhang	d4b174817d	docs: update sponsorship (DataCrunch) (#2523 )	2024-12-22 02:29:04 -08:00
Lei	19ba2b0ea9	Add lora_paths to v1_chat_generate_request (#2529 )	2024-12-22 02:23:33 -08:00
Yineng Zhang	4e1e3cff20	fix #2528 (#2541 )	2024-12-22 00:14:41 +08:00
Yineng Zhang	8f4d04e540	chore: bump v0.4.0.post2 (#2525 )	2024-12-21 21:16:34 +08:00
Jerry Zhang	feb2b768ba	Add integration with gemlite weight only quant (#2528 )	2024-12-21 00:25:25 +08:00
Yineng Zhang	d95a5f5bf5	fix followup #2517 (#2524 )	2024-12-19 23:24:30 +08:00
Yineng Zhang	4b83db24f1	fix: continue to use flashinfer 0.1.6 temporarily (#2517 )	2024-12-19 14:03:24 +08:00
Yineng Zhang	64456cf023	docs: update README (#2516 )	2024-12-19 13:44:02 +08:00
Yineng Zhang	bb4a922023	feat: add llama3 eval (#2515 )	2024-12-19 13:37:09 +08:00
Lianmin Zheng	21e9e63ad5	Print progress bar during cuda graph capture (#2502 )	2024-12-17 06:33:46 -08:00
Lianmin Zheng	1fc84cf60b	Update readme (#2500 ) Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: “yixin-huang1” <yixinhuang1@berkeley.edu>	2024-12-17 04:33:36 -08:00
Lianmin Zheng	361ea8d912	Fix openai protocols and pass top_k, min_p (#2499 )	2024-12-17 04:14:14 -08:00
Lei	33c5ff2845	Add lora_path to chat completion (#2438 )	2024-12-17 03:47:49 -08:00
Hui Liu	5ce9daea59	ROCm support for sglang.check_env (#2426 )	2024-12-17 03:45:14 -08:00
Ata Fatahi	ce094a5d79	Clean up GPU memory after killing sglang processes (#2457 ) Signed-off-by: Ata Fatahi <immrata@gmail.com>	2024-12-17 03:42:40 -08:00
bjmsong	e21026690d	benchmark decoding attention kernel with cudnn (#2467 ) Co-authored-by: root <bjmsong@126.com>	2024-12-17 03:31:57 -08:00
Lianmin Zheng	bd6196163e	Small fix for the order of apply_torchao_config (#2495 )	2024-12-16 19:21:11 -08:00
Lianmin Zheng	56198b45d9	Add a benchmark script for in-batch prefix caching (#2494 )	2024-12-16 18:49:02 -08:00
Lianmin Zheng	ba36b5520a	Revert "Small fixes for torchao quant" (#2493 )	2024-12-16 15:04:16 -08:00
Lianmin Zheng	9cd9dc83b3	Temporarily disable unit test of torch native attention backend (#2492 )	2024-12-16 14:17:27 -08:00
Lianmin Zheng	7a1aecb938	Simplify pytorch sampling kernel and logit processor (#2491 )	2024-12-16 14:11:09 -08:00
Jerry Zhang	82699474fd	Small fixes for torchao quant (#2476 )	2024-12-16 14:08:12 -08:00
Yineng Zhang	7154b4b1df	minor: update flashinfer nightly (#2490 )	2024-12-16 23:02:49 +08:00
xiaobochen	b532a5fd16	fix moe-ep accuracy issue for fp8 (#2489 )	2024-12-16 20:54:02 +08:00
Xiaoyu Zhang	a0592c059f	[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486 )	2024-12-15 13:52:08 +08:00
Yineng Zhang	e8dbdf75bc	fix typo (#2487 )	2024-12-15 13:44:55 +08:00
yizhang2077	e04d3f2897	adapt tensorrt llm custom all reduce to sgl-kernel (#2481 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-12-15 13:15:59 +08:00
Yineng Zhang	5f2595be43	hotfix: checking for HIP (#2485 )	2024-12-15 02:47:26 +08:00
Ke Bao	0ba2c58947	Remove cuda graph batch size adjustment for dp attention (#2484 )	2024-12-14 23:53:54 +08:00
Yineng Zhang	fccbfa3752	format: add clang-format for sgl-kernel (#2483 )	2024-12-14 22:36:04 +08:00
Ke Bao	2f9bd0fafd	Fix correctness issue for triton decoding kernel (#2479 )	2024-12-14 16:50:54 +08:00
Lianmin Zheng	5282a4735f	[Minor] Fix grok model loader (#2473 )	2024-12-12 14:34:47 -08:00
Yineng Zhang	f0ed9c353e	feat: support dev image (#2469 )	2024-12-13 02:23:52 +08:00

1 2 3 4 5 ...

1545 Commits