Commit Graph

67 Commits

Author SHA1 Message Date
Shi Shuai
7443197a63 [CI] Improve Docs CI Efficiency (#3587)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-14 19:57:00 -08:00
Ke Bao
862dd76c76 Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 (#3582) 2025-02-15 05:28:34 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Yineng Zhang
4d2dbeaca7 remove cutex dependency (#3422) 2025-02-09 18:33:20 +08:00
Yineng Zhang
d39899e85c upgrade flashinfer v0.2.0.post2 (#3288)
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-04 21:41:40 +08:00
Ke Bao
c23d5706f4 Update whl index path (#3128) 2025-01-25 23:57:09 +08:00
Ke Bao
665e5e85f6 Add step to update sgl-kernel whl index (#3110) 2025-01-25 02:03:01 +08:00
Byron Hsu
9a0cc2e90e [router] Forward all request headers from router to workers (#3070) 2025-01-23 20:30:31 -08:00
Lianmin Zheng
61f42b5732 Move sgl.Runtime under sglang/lang (#2990) 2025-01-19 17:10:29 -08:00
Byron Hsu
ef18b0eda2 [router] Allow empty worker list for sglang.launch_router (#2979) 2025-01-19 01:05:23 -08:00
Yineng Zhang
d06c1ab587 update ci install dependency (#2949) 2025-01-17 23:42:23 +08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
fzyzcjy
923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) 2025-01-13 11:38:51 -08:00
Lianmin Zheng
8a6906127a Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
2025-01-07 23:29:10 -08:00
Yineng Zhang
bc6ad367c2 fix lint (#2733) 2025-01-05 14:45:42 +08:00
Ce Gao
f5d0865b25 feat: Support VLM in reference_hf (#2726)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
2025-01-03 22:32:30 +08:00
Yineng Zhang
d49b13c6f8 feat: use CUDA 12.4 by default (for FA3) (#2682) 2024-12-31 15:52:09 +08:00
fzyzcjy
f707470019 CI: Update scripts to fail fast (#2672) 2024-12-30 19:04:01 -08:00
Yineng Zhang
d95a5f5bf5 fix followup #2517 (#2524) 2024-12-19 23:24:30 +08:00
Ata Fatahi
ce094a5d79 Clean up GPU memory after killing sglang processes (#2457)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-17 03:42:40 -08:00
Yineng Zhang
7154b4b1df minor: update flashinfer nightly (#2490) 2024-12-16 23:02:49 +08:00
Lianmin Zheng
835f8afc77 Migrate llama_classification to use the /classify interface (#2417) 2024-12-08 23:30:51 -08:00
Lianmin Zheng
96db0f666d Update killall_sglang.sh (#2397) 2024-12-08 01:56:26 -08:00
Yineng Zhang
75ae968959 minor: update killall script (#2391) 2024-12-08 04:21:00 +08:00
Yineng Zhang
3dbd73d319 minor: rm unused _grouped_size_compiled_for_decode_kernels (#2299) 2024-12-01 19:24:12 +08:00
Yineng Zhang
fc78640e00 minor: support flashinfer nightly (#2295) 2024-12-01 18:55:26 +08:00
Yineng Zhang
118b6af35e feat: add should_use_tensor_core (#2179) 2024-12-01 18:01:16 +08:00
Lianmin Zheng
9449a95431 [CI] Balance CI tests (#2293) 2024-12-01 01:47:30 -08:00
Lianmin Zheng
0d6a49bd7d [CI] Kill zombie processes (#2280) 2024-11-30 00:24:30 -08:00
Yineng Zhang
fae4e5e99a chore: bump v0.3.6.post3 (#2259) 2024-11-30 01:41:16 +08:00
Ying Sheng
e1e595d702 [feat] Refactor session control interface and add CI (#2173) 2024-11-25 12:32:51 -08:00
Lianmin Zheng
254fd130e2 [CI] Split test cases in CI for better load balancing (#2180) 2024-11-25 04:58:16 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Byron Hsu
30af7dfb34 [router] add base_gpu_id server args & merged radix tree python reference (#2115) 2024-11-21 17:13:33 -08:00
Lianmin Zheng
722530fa01 Enable overlap scheduler by default for the triton attention backend (#2105) 2024-11-20 02:58:35 -08:00
Lianmin Zheng
56a347f7d3 Move test_session_id.py to playground (#2104) 2024-11-20 01:28:27 -08:00
Ke Bao
62832bb272 Support cuda graph for DP attention (#2061) 2024-11-17 16:29:20 -08:00
Lianmin Zheng
9c939a3d8b Clean up metrics code (#1972) 2024-11-09 15:43:20 -08:00
Lianmin Zheng
7ef0084b0d Add sentence_transformers to CI dependency (#1958) 2024-11-08 01:21:29 -08:00
Chayenne
c77c1e05ba fix black in pre-commit (#1940) 2024-11-08 07:42:47 +08:00
Xuehai Pan
a5e0defb5a minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 13:46:04 +00:00
Byron Hsu
530ff541cf [router] Impl radix tree and set up CI (#1893)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-11-04 10:56:52 -08:00
Jani Monoses
916b3cdddc Allow passing dtype and max_new_tokens to HF reference script (#1903) 2024-11-03 08:24:37 -08:00
Lianmin Zheng
a2e0424abf Fix memory leak for chunked prefill 2 (#1858)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2024-10-31 14:51:51 -07:00
Lianmin Zheng
b548801ddb Update docs (#1839) 2024-10-30 02:49:08 -07:00
Lianmin Zheng
6aa94b967c Update ci workflows (#1804) 2024-10-26 04:32:36 -07:00
Ying Sheng
c5325aba75 [Profile] Add pytorch profiler (#1604) 2024-10-07 14:37:16 -07:00
Liangsheng Yin
99ec439da4 Organize Attention Backends (#1547) 2024-09-30 15:54:18 -07:00
Lianmin Zheng
fb2d0680e0 [Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510) 2024-09-24 21:37:33 -07:00
Lianmin Zheng
2854a5ea9f Fix the overhead due to penalizer in bench_latency (#1496) 2024-09-23 07:38:14 -07:00