sglang

Author	SHA1	Message	Date
Ke Bao	c23d5706f4	Update whl index path (#3128 )	2025-01-25 23:57:09 +08:00
Ke Bao	665e5e85f6	Add step to update sgl-kernel whl index (#3110 )	2025-01-25 02:03:01 +08:00
Byron Hsu	9a0cc2e90e	[router] Forward all request headers from router to workers (#3070 )	2025-01-23 20:30:31 -08:00
Lianmin Zheng	61f42b5732	Move sgl.Runtime under sglang/lang (#2990 )	2025-01-19 17:10:29 -08:00
Byron Hsu	ef18b0eda2	[router] Allow empty worker list for sglang.launch_router (#2979 )	2025-01-19 01:05:23 -08:00
Yineng Zhang	d06c1ab587	update ci install dependency (#2949 )	2025-01-17 23:42:23 +08:00
Lianmin Zheng	f65c13b559	Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902 )	2025-01-15 04:54:14 -08:00
fzyzcjy	923f518337	CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630 )	2025-01-13 11:38:51 -08:00
Lianmin Zheng	8a6906127a	Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784 ) Co-authored-by: SangBin Cho rkooo567@gmail.com	2025-01-07 23:29:10 -08:00
Yineng Zhang	bc6ad367c2	fix lint (#2733 )	2025-01-05 14:45:42 +08:00
Ce Gao	f5d0865b25	feat: Support VLM in reference_hf (#2726 ) Signed-off-by: Ce Gao <gaocegege@hotmail.com>	2025-01-03 22:32:30 +08:00
Yineng Zhang	d49b13c6f8	feat: use CUDA 12.4 by default (for FA3) (#2682 )	2024-12-31 15:52:09 +08:00
fzyzcjy	f707470019	CI: Update scripts to fail fast (#2672 )	2024-12-30 19:04:01 -08:00
Yineng Zhang	d95a5f5bf5	fix followup #2517 (#2524 )	2024-12-19 23:24:30 +08:00
Ata Fatahi	ce094a5d79	Clean up GPU memory after killing sglang processes (#2457 ) Signed-off-by: Ata Fatahi <immrata@gmail.com>	2024-12-17 03:42:40 -08:00
Yineng Zhang	7154b4b1df	minor: update flashinfer nightly (#2490 )	2024-12-16 23:02:49 +08:00
Lianmin Zheng	835f8afc77	Migrate llama_classification to use the /classify interface (#2417 )	2024-12-08 23:30:51 -08:00
Lianmin Zheng	96db0f666d	Update killall_sglang.sh (#2397 )	2024-12-08 01:56:26 -08:00
Yineng Zhang	75ae968959	minor: update killall script (#2391 )	2024-12-08 04:21:00 +08:00
Yineng Zhang	3dbd73d319	minor: rm unused _grouped_size_compiled_for_decode_kernels (#2299 )	2024-12-01 19:24:12 +08:00
Yineng Zhang	fc78640e00	minor: support flashinfer nightly (#2295 )	2024-12-01 18:55:26 +08:00
Yineng Zhang	118b6af35e	feat: add should_use_tensor_core (#2179 )	2024-12-01 18:01:16 +08:00
Lianmin Zheng	9449a95431	[CI] Balance CI tests (#2293 )	2024-12-01 01:47:30 -08:00
Lianmin Zheng	0d6a49bd7d	[CI] Kill zombie processes (#2280 )	2024-11-30 00:24:30 -08:00
Yineng Zhang	fae4e5e99a	chore: bump v0.3.6.post3 (#2259 )	2024-11-30 01:41:16 +08:00
Ying Sheng	e1e595d702	[feat] Refactor session control interface and add CI (#2173 )	2024-11-25 12:32:51 -08:00
Lianmin Zheng	254fd130e2	[CI] Split test cases in CI for better load balancing (#2180 )	2024-11-25 04:58:16 -08:00
Xuehai Pan	62a4a339eb	docs: fix module docstrings and copyright headers (#2077 )	2024-11-22 22:16:53 +08:00
Byron Hsu	30af7dfb34	[router] add base_gpu_id server args & merged radix tree python reference (#2115 )	2024-11-21 17:13:33 -08:00
Lianmin Zheng	722530fa01	Enable overlap scheduler by default for the triton attention backend (#2105 )	2024-11-20 02:58:35 -08:00
Lianmin Zheng	56a347f7d3	Move test_session_id.py to playground (#2104 )	2024-11-20 01:28:27 -08:00
Ke Bao	62832bb272	Support cuda graph for DP attention (#2061 )	2024-11-17 16:29:20 -08:00
Lianmin Zheng	9c939a3d8b	Clean up metrics code (#1972 )	2024-11-09 15:43:20 -08:00
Lianmin Zheng	7ef0084b0d	Add sentence_transformers to CI dependency (#1958 )	2024-11-08 01:21:29 -08:00
Chayenne	c77c1e05ba	fix black in pre-commit (#1940 )	2024-11-08 07:42:47 +08:00
Xuehai Pan	a5e0defb5a	minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926 )	2024-11-06 13:46:04 +00:00
Byron Hsu	530ff541cf	[router] Impl radix tree and set up CI (#1893 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-11-04 10:56:52 -08:00
Jani Monoses	916b3cdddc	Allow passing dtype and max_new_tokens to HF reference script (#1903 )	2024-11-03 08:24:37 -08:00
Lianmin Zheng	a2e0424abf	Fix memory leak for chunked prefill 2 (#1858 ) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2024-10-31 14:51:51 -07:00
Lianmin Zheng	b548801ddb	Update docs (#1839 )	2024-10-30 02:49:08 -07:00
Lianmin Zheng	6aa94b967c	Update ci workflows (#1804 )	2024-10-26 04:32:36 -07:00
Ying Sheng	c5325aba75	[Profile] Add pytorch profiler (#1604 )	2024-10-07 14:37:16 -07:00
Liangsheng Yin	99ec439da4	Organize Attention Backends (#1547 )	2024-09-30 15:54:18 -07:00
Lianmin Zheng	fb2d0680e0	[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510 )	2024-09-24 21:37:33 -07:00
Lianmin Zheng	2854a5ea9f	Fix the overhead due to penalizer in bench_latency (#1496 )	2024-09-23 07:38:14 -07:00
Lianmin Zheng	167591e864	Better unit tests for adding a new model (#1488 )	2024-09-22 01:50:37 -07:00
Lianmin Zheng	2cd7e181dd	Fix env vars in bench_latency (#1472 )	2024-09-19 03:19:26 -07:00
Ying Sheng	37963394aa	[Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433 )	2024-09-15 12:46:04 -07:00
Ying Sheng	712216928f	[Feature] Initial support for multi-LoRA serving (#1307 )	2024-09-12 16:46:14 -07:00
Lianmin Zheng	3a6e8b6d78	[Minor] move triton attention kernels into a separate folder (#1379 )	2024-09-10 15:15:08 -07:00

1 2

62 Commits