Commit Graph

156 Commits

Author SHA1 Message Date
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Qun Yang
62c516ac45 Add a simple torch native attention backend (#2241) 2024-12-01 03:01:25 -08:00
Lianmin Zheng
4936be8acc Revert "Revert "[FEAT] Support GGUF format"" (#2287) 2024-11-30 22:14:48 -08:00
Lianmin Zheng
7e4c6dd8da Revert "[FEAT] Support GGUF format" (#2285) 2024-11-30 19:03:26 -08:00
Yang Zheng
883c955489 [FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-30 00:44:48 -08:00
Lianmin Zheng
94e167ea5a Fix the default chunked prefill size (#2268) 2024-11-29 16:03:32 -08:00
Xiaoyu Zhang
262e370f78 [benchmark] Add fused_moe_triton benchmark and tuning tools (#2225)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
2024-11-29 13:36:45 -08:00
Ying Sheng
8b48496aaf Revert "Revert "Add simple CPU offloading support"" (#2253)
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-28 23:58:54 -08:00
Ying Sheng
4057ea82c9 Revert "Add simple CPU offloading support" (#2252)
We'll re-add the commit to correctly ack Kaichao's authorship
2024-11-28 23:36:55 -08:00
Lianmin Zheng
09798b36cd Fix chunked prefill size for bench_offline_throughput (#2234) 2024-11-27 23:37:20 -08:00
Lianmin Zheng
731146f6cb Fix mixed chunked prefill in overlap mode (#2158) 2024-11-24 07:17:37 -08:00
Jani Monoses
d98fa1e93d Add simple CPU offloading support. (#2081) 2024-11-23 06:23:53 +00:00
Ankur Neog
865233e256 Add initial support for intel Gaudi accelerators (#2121) 2024-11-22 20:22:23 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Byron Hsu
30af7dfb34 [router] add base_gpu_id server args & merged radix tree python reference (#2115) 2024-11-21 17:13:33 -08:00
Jerry Zhang
5c6a41facf Error out when torchao-config option is not recognized (#2107) 2024-11-20 17:37:28 -08:00
Lianmin Zheng
722530fa01 Enable overlap scheduler by default for the triton attention backend (#2105) 2024-11-20 02:58:35 -08:00
Lianmin Zheng
7d671e4ad2 Enable overlap by default (#2067) 2024-11-19 22:07:58 -08:00
Ke Bao
699384cb01 Set schedule policy more conservative for DP attention (#2096) 2024-11-19 20:57:18 -08:00
Lianmin Zheng
ffd20fcd03 Make constrained decoding work for overlap scheduler (#2095) 2024-11-19 15:04:43 -08:00
Lianmin Zheng
b110453802 Simplify logits penalizer (#2086) 2024-11-18 17:48:28 -08:00
Lianmin Zheng
ebaa2f3199 Rename arguments --disable-nan-detection to --enable-nan-detection (#2066) 2024-11-17 16:53:44 -08:00
Ke Bao
62832bb272 Support cuda graph for DP attention (#2061) 2024-11-17 16:29:20 -08:00
Lianmin Zheng
11f881d173 Deprecate --disable-flashinfer and --disable-flashinfer-sampling (#2065) 2024-11-17 16:20:58 -08:00
Lianmin Zheng
f719d9aebc Launch dp ranks in parallel (#2053)
Co-authored-by: Haotian Liu <6631389+haotian-liu@users.noreply.github.com>
2024-11-16 17:39:39 -08:00
Ke Bao
976bc302e5 Support DP MLA (#1970) 2024-11-16 09:01:43 +00:00
HAI
2ffe0a7363 Add get_amdgpu_memory_capacity() (#2049) 2024-11-15 22:51:48 -08:00
Lianmin Zheng
b01df48cf2 [Fix] Adjust default chunked prefill size and cuda graph max bs according to GPU memory capacity (#2044) 2024-11-15 06:21:57 -08:00
Patrick Yi
13ce3e4b5d Add download_dir ServerArgs property (#2027) 2024-11-13 23:26:56 -08:00
Lianmin Zheng
ba069a24d3 Fix grammar backend (#2018) 2024-11-12 21:17:38 -08:00
Lianmin Zheng
a509552087 [minor] Improve code style and compatibility (#1961) 2024-11-08 02:19:41 -08:00
Chayenne
c77c1e05ba fix black in pre-commit (#1940) 2024-11-08 07:42:47 +08:00
Lzhang-hub
a146d9990e support prometheus metrics (#1853)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-11-05 20:42:53 -08:00
Byron Hsu
a7a0a6886b Make decode log interval configurable (#1847) 2024-10-30 19:59:20 -07:00
Lianmin Zheng
86fc0d79d0 Add a watch dog thread (#1816) 2024-10-27 02:00:50 -07:00
Lianmin Zheng
86e0dde555 Improve the user control of new_token_ratio (#1811) 2024-10-26 16:39:41 -07:00
Lianmin Zheng
2b80978859 Provide an argument to set the maximum batch size for cuda graph (#1809) 2024-10-26 15:09:33 -07:00
Liangsheng Yin
07bf2e846a Allow consecutive ports when launching multiple sglang servers. (#1802) 2024-10-26 06:43:24 +00:00
DarkSharpness
b77a02cdfd [Performance] Support both xgrammar and outlines for constrained decoding (#1752) 2024-10-25 21:47:02 +00:00
Lianmin Zheng
b121bc03a3 Simplify batch result resolution (#1735) 2024-10-20 19:47:14 -07:00
Lianmin Zheng
f0f8a7699b Simplify the nan detection and greedy check in sampler (#1709) 2024-10-18 20:21:24 -07:00
havetc
ecb8bad276 Returning a per request metric for number of cached_tokens read (#1599) 2024-10-16 11:49:22 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Shuo Yang
061e546313 Support double sparsity (#1459) 2024-10-14 02:00:41 -07:00
Lianmin Zheng
7ee6c259ff Simplify the event loop and expose --num-continuous-decode-steps as an argument (#1652) 2024-10-12 21:35:30 -07:00
Lianmin Zheng
9da5a60b18 Add an option to disable penalizer (#1651) 2024-10-12 17:53:23 -07:00
Zhang, Liangang
5d638c92f5 [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480) 2024-10-12 18:10:32 +00:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
glen-amd
58093b868f Nit about the decorator of PortArgs.init_new (#1611) 2024-10-11 02:17:47 -07:00
Zhang, Liangang
8275049ce3 Add device support (#1607) 2024-10-11 02:05:58 -07:00