Ke Bao
|
c23d5706f4
|
Update whl index path (#3128)
|
2025-01-25 23:57:09 +08:00 |
|
Ke Bao
|
665e5e85f6
|
Add step to update sgl-kernel whl index (#3110)
|
2025-01-25 02:03:01 +08:00 |
|
Byron Hsu
|
9a0cc2e90e
|
[router] Forward all request headers from router to workers (#3070)
|
2025-01-23 20:30:31 -08:00 |
|
Lianmin Zheng
|
61f42b5732
|
Move sgl.Runtime under sglang/lang (#2990)
|
2025-01-19 17:10:29 -08:00 |
|
Byron Hsu
|
ef18b0eda2
|
[router] Allow empty worker list for sglang.launch_router (#2979)
|
2025-01-19 01:05:23 -08:00 |
|
Yineng Zhang
|
d06c1ab587
|
update ci install dependency (#2949)
|
2025-01-17 23:42:23 +08:00 |
|
Lianmin Zheng
|
f65c13b559
|
Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902)
|
2025-01-15 04:54:14 -08:00 |
|
fzyzcjy
|
923f518337
|
CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630)
|
2025-01-13 11:38:51 -08:00 |
|
Lianmin Zheng
|
8a6906127a
|
Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
|
2025-01-07 23:29:10 -08:00 |
|
Yineng Zhang
|
bc6ad367c2
|
fix lint (#2733)
|
2025-01-05 14:45:42 +08:00 |
|
Ce Gao
|
f5d0865b25
|
feat: Support VLM in reference_hf (#2726)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
|
2025-01-03 22:32:30 +08:00 |
|
Yineng Zhang
|
d49b13c6f8
|
feat: use CUDA 12.4 by default (for FA3) (#2682)
|
2024-12-31 15:52:09 +08:00 |
|
fzyzcjy
|
f707470019
|
CI: Update scripts to fail fast (#2672)
|
2024-12-30 19:04:01 -08:00 |
|
Yineng Zhang
|
d95a5f5bf5
|
fix followup #2517 (#2524)
|
2024-12-19 23:24:30 +08:00 |
|
Ata Fatahi
|
ce094a5d79
|
Clean up GPU memory after killing sglang processes (#2457)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
|
2024-12-17 03:42:40 -08:00 |
|
Yineng Zhang
|
7154b4b1df
|
minor: update flashinfer nightly (#2490)
|
2024-12-16 23:02:49 +08:00 |
|
Lianmin Zheng
|
835f8afc77
|
Migrate llama_classification to use the /classify interface (#2417)
|
2024-12-08 23:30:51 -08:00 |
|
Lianmin Zheng
|
96db0f666d
|
Update killall_sglang.sh (#2397)
|
2024-12-08 01:56:26 -08:00 |
|
Yineng Zhang
|
75ae968959
|
minor: update killall script (#2391)
|
2024-12-08 04:21:00 +08:00 |
|
Yineng Zhang
|
3dbd73d319
|
minor: rm unused _grouped_size_compiled_for_decode_kernels (#2299)
|
2024-12-01 19:24:12 +08:00 |
|
Yineng Zhang
|
fc78640e00
|
minor: support flashinfer nightly (#2295)
|
2024-12-01 18:55:26 +08:00 |
|
Yineng Zhang
|
118b6af35e
|
feat: add should_use_tensor_core (#2179)
|
2024-12-01 18:01:16 +08:00 |
|
Lianmin Zheng
|
9449a95431
|
[CI] Balance CI tests (#2293)
|
2024-12-01 01:47:30 -08:00 |
|
Lianmin Zheng
|
0d6a49bd7d
|
[CI] Kill zombie processes (#2280)
|
2024-11-30 00:24:30 -08:00 |
|
Yineng Zhang
|
fae4e5e99a
|
chore: bump v0.3.6.post3 (#2259)
|
2024-11-30 01:41:16 +08:00 |
|
Ying Sheng
|
e1e595d702
|
[feat] Refactor session control interface and add CI (#2173)
|
2024-11-25 12:32:51 -08:00 |
|
Lianmin Zheng
|
254fd130e2
|
[CI] Split test cases in CI for better load balancing (#2180)
|
2024-11-25 04:58:16 -08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Byron Hsu
|
30af7dfb34
|
[router] add base_gpu_id server args & merged radix tree python reference (#2115)
|
2024-11-21 17:13:33 -08:00 |
|
Lianmin Zheng
|
722530fa01
|
Enable overlap scheduler by default for the triton attention backend (#2105)
|
2024-11-20 02:58:35 -08:00 |
|
Lianmin Zheng
|
56a347f7d3
|
Move test_session_id.py to playground (#2104)
|
2024-11-20 01:28:27 -08:00 |
|
Ke Bao
|
62832bb272
|
Support cuda graph for DP attention (#2061)
|
2024-11-17 16:29:20 -08:00 |
|
Lianmin Zheng
|
9c939a3d8b
|
Clean up metrics code (#1972)
|
2024-11-09 15:43:20 -08:00 |
|
Lianmin Zheng
|
7ef0084b0d
|
Add sentence_transformers to CI dependency (#1958)
|
2024-11-08 01:21:29 -08:00 |
|
Chayenne
|
c77c1e05ba
|
fix black in pre-commit (#1940)
|
2024-11-08 07:42:47 +08:00 |
|
Xuehai Pan
|
a5e0defb5a
|
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926)
|
2024-11-06 13:46:04 +00:00 |
|
Byron Hsu
|
530ff541cf
|
[router] Impl radix tree and set up CI (#1893)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-11-04 10:56:52 -08:00 |
|
Jani Monoses
|
916b3cdddc
|
Allow passing dtype and max_new_tokens to HF reference script (#1903)
|
2024-11-03 08:24:37 -08:00 |
|
Lianmin Zheng
|
a2e0424abf
|
Fix memory leak for chunked prefill 2 (#1858)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2024-10-31 14:51:51 -07:00 |
|
Lianmin Zheng
|
b548801ddb
|
Update docs (#1839)
|
2024-10-30 02:49:08 -07:00 |
|
Lianmin Zheng
|
6aa94b967c
|
Update ci workflows (#1804)
|
2024-10-26 04:32:36 -07:00 |
|
Ying Sheng
|
c5325aba75
|
[Profile] Add pytorch profiler (#1604)
|
2024-10-07 14:37:16 -07:00 |
|
Liangsheng Yin
|
99ec439da4
|
Organize Attention Backends (#1547)
|
2024-09-30 15:54:18 -07:00 |
|
Lianmin Zheng
|
fb2d0680e0
|
[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510)
|
2024-09-24 21:37:33 -07:00 |
|
Lianmin Zheng
|
2854a5ea9f
|
Fix the overhead due to penalizer in bench_latency (#1496)
|
2024-09-23 07:38:14 -07:00 |
|
Lianmin Zheng
|
167591e864
|
Better unit tests for adding a new model (#1488)
|
2024-09-22 01:50:37 -07:00 |
|
Lianmin Zheng
|
2cd7e181dd
|
Fix env vars in bench_latency (#1472)
|
2024-09-19 03:19:26 -07:00 |
|
Ying Sheng
|
37963394aa
|
[Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433)
|
2024-09-15 12:46:04 -07:00 |
|
Ying Sheng
|
712216928f
|
[Feature] Initial support for multi-LoRA serving (#1307)
|
2024-09-12 16:46:14 -07:00 |
|
Lianmin Zheng
|
3a6e8b6d78
|
[Minor] move triton attention kernels into a separate folder (#1379)
|
2024-09-10 15:15:08 -07:00 |
|