Yudi Xue
|
19f33b3237
|
add sglang version to get_server_info (#2206)
|
2024-11-26 12:10:23 -08:00 |
|
Lianmin Zheng
|
ea34350d88
|
Rename double sparsity config file (#2188)
|
2024-11-25 17:12:08 -08:00 |
|
Lianmin Zheng
|
1605ae121e
|
[CI] Minor fix for CI (#2187)
|
2024-11-25 16:38:43 -08:00 |
|
Rin Intachuen
|
1aea19f64b
|
Input_embeds support (#2052)
|
2024-11-25 16:35:04 -08:00 |
|
Yixin Dong
|
7f076c2ce6
|
Update XGrammar to the latest API (#2176)
Co-authored-by: Ben Gitter <gitterbd@gmail.com>
|
2024-11-25 15:58:30 -08:00 |
|
Lianmin Zheng
|
3c5538f781
|
Update CI threshold (#2186)
|
2024-11-25 15:24:17 -08:00 |
|
Ying Sheng
|
e1e595d702
|
[feat] Refactor session control interface and add CI (#2173)
|
2024-11-25 12:32:51 -08:00 |
|
Lianmin Zheng
|
254fd130e2
|
[CI] Split test cases in CI for better load balancing (#2180)
|
2024-11-25 04:58:16 -08:00 |
|
Lianmin Zheng
|
5652c56535
|
Update CI threshold & Improve code style (#2159)
|
2024-11-24 06:29:38 -08:00 |
|
Henry Hyeonmok Ko
|
dbe1729395
|
Merged three native APIs into one: get_server_info (#2152)
|
2024-11-24 01:37:58 -08:00 |
|
Lianmin Zheng
|
a78d8f8db3
|
[CI] Fix test cases (#2137)
|
2024-11-23 01:00:07 -08:00 |
|
Jani Monoses
|
d98fa1e93d
|
Add simple CPU offloading support. (#2081)
|
2024-11-23 06:23:53 +00:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Yineng Zhang
|
4f8c3aeafc
|
minor: update gsm8k threshold (#2125)
|
2024-11-22 19:23:58 +08:00 |
|
Lianmin Zheng
|
dfec7fca06
|
Rename sglang.bench_latency to sglang.bench_one_batch (#2118)
|
2024-11-21 20:07:48 -08:00 |
|
Jake Poznanski
|
8048c28c11
|
Fix #2037 - Context length check does not take into out pad tokens for visual models (#2106)
|
2024-11-21 19:05:41 -08:00 |
|
James Xu
|
f6f713797b
|
Add support for Qwen2-VL-based embedding models (#2055)
|
2024-11-21 14:24:25 -08:00 |
|
Lianmin Zheng
|
56a347f7d3
|
Move test_session_id.py to playground (#2104)
|
2024-11-20 01:28:27 -08:00 |
|
Ying Sheng
|
5942dfc00a
|
[feat] Add session control (#2073)
|
2024-11-20 00:36:53 -08:00 |
|
Lianmin Zheng
|
7d671e4ad2
|
Enable overlap by default (#2067)
|
2024-11-19 22:07:58 -08:00 |
|
Yineng Zhang
|
f239268fad
|
minor: update gsm8k eval (#2091)
|
2024-11-19 20:36:55 +08:00 |
|
Lianmin Zheng
|
b7a065eae3
|
Use cuda event wait and synchronization instead of busy waiting (#2089)
|
2024-11-19 00:21:46 -08:00 |
|
Lianmin Zheng
|
b110453802
|
Simplify logits penalizer (#2086)
|
2024-11-18 17:48:28 -08:00 |
|
Lianmin Zheng
|
80e2c4a8de
|
Fix chunked prefill with output logprob (#2083)
|
2024-11-18 13:16:28 -08:00 |
|
Yineng Zhang
|
766192610e
|
feat: update torch 2.5.1 (#2069)
|
2024-11-18 21:29:13 +08:00 |
|
Lianmin Zheng
|
4af3f889fc
|
Simplify flashinfer indices update for prefill (#2074)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: kavioyu <kavioyu@gmail.com>
|
2024-11-18 00:02:36 -08:00 |
|
Lianmin Zheng
|
a7164b620f
|
Tune the threshold for accuracy tests in CI (#2071)
|
2024-11-17 21:51:00 -08:00 |
|
Lianmin Zheng
|
116685337e
|
Fix cuda illegal memory access in overlap mode (#2070)
|
2024-11-17 21:29:30 -08:00 |
|
Lianmin Zheng
|
a9e90b4bce
|
[Minor] Fix styles for overlap mode (#2068)
|
2024-11-17 19:49:20 -08:00 |
|
Tanjiro
|
8c280cee55
|
add phi-3 small support (#2062)
Co-authored-by: Tushar Goel <114812108+AI-Tushar@users.noreply.github.com>
|
2024-11-17 18:47:43 -08:00 |
|
Lianmin Zheng
|
11f881d173
|
Deprecate --disable-flashinfer and --disable-flashinfer-sampling (#2065)
|
2024-11-17 16:20:58 -08:00 |
|
Lianmin Zheng
|
38625e2139
|
Remove monkey_patch_vllm_dummy_weight_loader (#2064)
|
2024-11-17 15:48:12 -08:00 |
|
Lianmin Zheng
|
c1f401fc58
|
Revert "chore: update torch v2.5.1" (#2063)
|
2024-11-17 15:29:38 -08:00 |
|
Yineng Zhang
|
3b878863f7
|
chore: update torch v2.5.1 (#1849)
|
2024-11-18 00:06:00 +08:00 |
|
Lianmin Zheng
|
f719d9aebc
|
Launch dp ranks in parallel (#2053)
Co-authored-by: Haotian Liu <6631389+haotian-liu@users.noreply.github.com>
|
2024-11-16 17:39:39 -08:00 |
|
Ke Bao
|
976bc302e5
|
Support DP MLA (#1970)
|
2024-11-16 09:01:43 +00:00 |
|
Ke Wen
|
cf2489762b
|
Add Tensor Parallel to torch_native_llama (#1876)
|
2024-11-15 21:26:00 -08:00 |
|
Lianmin Zheng
|
2558d6a675
|
Fix the default arguments of bench_offline_throughput.py & simplify detokenizer manager (#2042)
|
2024-11-15 05:02:44 -08:00 |
|
zolinthecow
|
f6dd648620
|
Offline LLM Engine Benchmark Throughput (#1968)
Co-authored-by: ByronHsu <byronhsu1230@gmail.com>
|
2024-11-14 21:59:33 -08:00 |
|
Lianmin Zheng
|
c3eac1b010
|
Fix torch.compile for MoE (#2033)
|
2024-11-14 01:30:24 -08:00 |
|
Lianmin Zheng
|
f407fcf9ef
|
Release v0.3.5.post1 (#2022)
|
2024-11-13 10:27:12 -08:00 |
|
Lianmin Zheng
|
ba069a24d3
|
Fix grammar backend (#2018)
|
2024-11-12 21:17:38 -08:00 |
|
DarkSharpness
|
125b1199c5
|
support parallel grammar preprocessing (#1996)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-11-12 08:45:28 -08:00 |
|
Xiaoyu Zhang
|
eff468dd5a
|
fix test_embedding_models prompt length too long's bug (#2015)
|
2024-11-12 23:21:16 +08:00 |
|
Xiaoyu Zhang
|
027e65248f
|
support echo=true and logprobs in openai api when logprobs=1 in lm-evaluation-harness (#1998)
|
2024-11-11 23:21:20 -08:00 |
|
James Xu
|
ddeb9d42de
|
Add engine encode (#1995)
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
|
2024-11-11 11:48:17 -08:00 |
|
Lianmin Zheng
|
1929c06762
|
Simplify prometheus metrics (#1981)
Co-authored-by: Mohit Reddy <mohitreddy1996@users.noreply.github.com>
|
2024-11-10 04:39:32 -08:00 |
|
Lianmin Zheng
|
520f0094e4
|
[CI] balance unit tests (#1977)
|
2024-11-09 16:46:14 -08:00 |
|
Lianmin Zheng
|
9c939a3d8b
|
Clean up metrics code (#1972)
|
2024-11-09 15:43:20 -08:00 |
|
Lianmin Zheng
|
549e8b8366
|
[Minor] Fix a typo in test_torchao.py (#1976)
|
2024-11-09 15:07:27 -08:00 |
|