Commit Graph

283 Commits

Author SHA1 Message Date
ylying
fe3be1595d Add qwen2 tie word embedding (#630) 2024-07-16 11:48:49 -07:00
Ying Sheng
0aa189f150 Disable NCCL_NVLS by default (#631) 2024-07-16 09:05:10 -07:00
Liangsheng Yin
c9ee3d3559 Fix model forward grad (#628) 2024-07-15 22:09:09 -07:00
Lianmin Zheng
41d1f67704 Fix flush cache (#627) 2024-07-15 20:44:04 -07:00
Ying Sheng
56f5fc4ab5 Bump version to 0.1.21 (#626) 2024-07-15 13:10:53 -07:00
Ying Sheng
6a2941f4d0 Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
2024-07-15 07:10:51 -07:00
Mingyi
5ac8b80677 Simplify mem state (#623) 2024-07-15 02:01:09 -07:00
Liangsheng Yin
a56858ba67 Unify index operations (#620) 2024-07-14 12:55:55 -07:00
Liangsheng Yin
564a898ad9 Optimize mem indices mangement (#619) 2024-07-13 23:39:37 -07:00
Lianmin Zheng
5d264a90ac Bump version to 0.1.20 (#618) 2024-07-13 17:27:55 -07:00
Ying Sheng
5949b1ca0e Fix memory pool index error (#616) 2024-07-13 16:45:11 -07:00
Lianmin Zheng
0feca02dd9 Improve benchmark scripts (#615) 2024-07-13 15:59:04 -07:00
Liangsheng Yin
10143e1a5f Memorypool chunked prefetch (#614) 2024-07-13 15:24:03 -07:00
Lianmin Zheng
65c6577696 Improve benchmark scripts & fix llava (#613) 2024-07-13 15:00:26 -07:00
Lianmin Zheng
665815969a Enable cuda graph by default (#612) 2024-07-13 05:29:46 -07:00
Lianmin Zheng
396a69240f Cleanup attention backend: flashinfer and triton (#611) 2024-07-12 18:21:11 -07:00
Lianmin Zheng
af4e7910e7 Clean up the usage of flashinfer (#610) 2024-07-12 13:00:03 -07:00
Lianmin Zheng
519e20cfda Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py (#609) 2024-07-12 12:28:09 -07:00
Lianmin Zheng
d9a6902986 Fix bench latency (#607) 2024-07-11 14:37:01 -07:00
Lianmin Zheng
ad872feb14 bump version to 0.1.19 2024-07-09 02:23:14 -07:00
Lianmin Zheng
da2e5d6546 Fix the default argument of OpenAI Chat completion (#605) 2024-07-09 02:04:43 -07:00
胡译文
02b7258658 [Feat] Expose logprob options to sgl.gen API (#503)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-07-09 00:35:39 -07:00
prophe
d557e9f3b7 Update chat template for qwen and yi-1.5. (#530) 2024-07-08 23:55:44 -07:00
Tommy Yang
740c46a152 Add Qwen2 MoE support (#603) 2024-07-08 23:44:59 -07:00
Tommy Yang
b38687226a Make sglang compat with vllm 0.5.1 (#598) 2024-07-08 23:44:22 -07:00
Pan Lyu
710f614ebe add minicpm support (#602) 2024-07-08 23:27:04 -07:00
Liangsheng Yin
f25b76c02a add LogitsMetadata (#604) 2024-07-08 17:46:55 -07:00
Mingyi
f4e885b7c3 Reduce number of workspaces (#601) 2024-07-07 19:35:22 -07:00
Liangsheng Yin
0877f1e75b Fix streaming (#600) 2024-07-07 01:55:58 -07:00
Liangsheng Yin
5304b4ef58 Add --enable-p2p-check option (#599) 2024-07-06 23:34:10 -07:00
Pan Lyu
26908d9568 * fix(detokenizer_manager.py): fix truncated decoded output (#586)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-07-06 14:53:22 -07:00
Mingyi
c0982ac553 Fix Llava model (#594) 2024-07-06 00:58:46 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
Ying Sheng
5a57b8addd Add Gemma2 (#592) 2024-07-05 09:48:54 -07:00
Ying Sheng
2f11936f95 bump version to 0.1.18 2024-07-04 06:27:29 +00:00
Lianmin Zheng
63fbef9876 fix flashinfer & http log level 2024-07-03 23:19:33 -07:00
Ying Sheng
2a754e57b0 2x performance improvement for large prefill & Fix workspace conflicts (#579) 2024-07-03 16:14:57 -07:00
Liangsheng Yin
96c503eb60 fix the broken server args (#585) 2024-07-03 16:01:19 -07:00
Chen Xuechen Li
441cca773d support gptj style rope in llama 2024-07-03 22:06:58 +00:00
Lianmin Zheng
c7709d3abe Update install commands (#583) 2024-07-03 02:10:59 -07:00
Ying Sheng
9380f50ff9 Turn on flashinfer by default (#578) 2024-07-02 02:25:07 -07:00
Daniel Hernandez Garcia
95dc093b19 [BugFix] gemma loading weights "lm_head.weight" key error (#577) 2024-07-01 22:10:07 -07:00
Yueyang Pan
d9ac639202 Fix flashinfer version (#576) 2024-07-01 22:08:39 -07:00
Ying Sheng
75b31a2a88 Update run_batch interface and max_prefill_tokens (#574) 2024-06-30 18:26:04 -07:00
sglang
11616fc6bd Minor fix in compiler & format (#545) 2024-06-29 23:42:14 -07:00
Ying Sheng
9ce89bc14b Update benchmark script (#571) 2024-06-28 00:44:22 -07:00
Lianmin Zheng
badf3fa020 Expose dtype argument (#569) 2024-06-27 23:30:39 -07:00
Lianmin Zheng
2e6e62e156 Increase the number of thread limitation for tp worker managers. (#567) 2024-06-26 09:33:45 -07:00
Lianmin Zheng
a385ee27bd Warmup cublas (#566) 2024-06-25 12:46:00 -07:00
Lianmin Zheng
eb1ae6ae0c Add sglang.bench_latency for offline benchmark (#564) 2024-06-25 03:38:04 -07:00