Commit Graph

253 Commits

Author SHA1 Message Date
Pan Lyu
26908d9568 * fix(detokenizer_manager.py): fix truncated decoded output (#586)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-07-06 14:53:22 -07:00
Mingyi
c0982ac553 Fix Llava model (#594) 2024-07-06 00:58:46 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
Ying Sheng
5a57b8addd Add Gemma2 (#592) 2024-07-05 09:48:54 -07:00
Ying Sheng
2f11936f95 bump version to 0.1.18 2024-07-04 06:27:29 +00:00
Lianmin Zheng
63fbef9876 fix flashinfer & http log level 2024-07-03 23:19:33 -07:00
Ying Sheng
2a754e57b0 2x performance improvement for large prefill & Fix workspace conflicts (#579) 2024-07-03 16:14:57 -07:00
Liangsheng Yin
96c503eb60 fix the broken server args (#585) 2024-07-03 16:01:19 -07:00
Chen Xuechen Li
441cca773d support gptj style rope in llama 2024-07-03 22:06:58 +00:00
Lianmin Zheng
c7709d3abe Update install commands (#583) 2024-07-03 02:10:59 -07:00
Ying Sheng
9380f50ff9 Turn on flashinfer by default (#578) 2024-07-02 02:25:07 -07:00
Daniel Hernandez Garcia
95dc093b19 [BugFix] gemma loading weights "lm_head.weight" key error (#577) 2024-07-01 22:10:07 -07:00
Yueyang Pan
d9ac639202 Fix flashinfer version (#576) 2024-07-01 22:08:39 -07:00
Ying Sheng
75b31a2a88 Update run_batch interface and max_prefill_tokens (#574) 2024-06-30 18:26:04 -07:00
sglang
11616fc6bd Minor fix in compiler & format (#545) 2024-06-29 23:42:14 -07:00
Ying Sheng
9ce89bc14b Update benchmark script (#571) 2024-06-28 00:44:22 -07:00
Lianmin Zheng
badf3fa020 Expose dtype argument (#569) 2024-06-27 23:30:39 -07:00
Lianmin Zheng
2e6e62e156 Increase the number of thread limitation for tp worker managers. (#567) 2024-06-26 09:33:45 -07:00
Lianmin Zheng
a385ee27bd Warmup cublas (#566) 2024-06-25 12:46:00 -07:00
Lianmin Zheng
eb1ae6ae0c Add sglang.bench_latency for offline benchmark (#564) 2024-06-25 03:38:04 -07:00
Lianmin Zheng
2187f36237 Add a new arguments log_level_http to control the HTTP logging (#563) 2024-06-25 01:16:20 -07:00
Lianmin Zheng
9465b668b9 Allow running with vllm==0.4.3 (#561) 2024-06-24 15:24:21 -07:00
Lianmin Zheng
1fa15099d8 Add LlamaForClassification (#559) 2024-06-22 00:49:31 -07:00
Lianmin Zheng
303ef8883e Clean up logits processor (#558) 2024-06-22 00:25:24 -07:00
Lianmin Zheng
e94e60d6fb make flashinfer workspace larger 2024-06-21 17:32:36 -07:00
Lianmin Zheng
d2f8bfb2e1 Follow-up fixes for flashinfer 0.0.5 (#556) 2024-06-20 23:19:52 -07:00
Lianmin Zheng
b7e2f800ac Update flashinfer to 0.0.5 (#554) 2024-06-20 20:29:06 -07:00
Ying Sheng
09593e9bc9 Multi-node Tensor Parallelism (#550)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-06-17 20:41:24 -07:00
Lianmin Zheng
53a7ebd89a Update fused_moe (#553) 2024-06-17 09:47:58 -07:00
Liangsheng Yin
ad5f04d6ce Fix the Jump-Forward with Chinese (#551) 2024-06-16 21:45:04 +08:00
Qubitium-modelcloud
bbec01c9aa Fix tp worker only checking req[0] for stream (#546) 2024-06-14 22:56:10 -07:00
Ying Sheng
fb9296f0ed Higher priority for user input of max_prefill_tokens & format (#540) 2024-06-12 21:48:40 -07:00
Ying Sheng
1374334d38 Fix dependency & crash issues (#539) 2024-06-12 21:23:19 -07:00
Lianmin Zheng
94aead9e8d Fix dependency (#538) 2024-06-12 13:17:35 -07:00
Liangsheng Yin
9c902b1954 Decode Incrementally (#517) 2024-06-11 23:39:12 -07:00
ZhouXingg
111991fe23 Fix Regression: Disable p2p for 4090 (#531)
Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com>
2024-06-11 23:27:17 -07:00
Qubitium
a8c787d2b3 Add ChatGLM Model Support (#516)
Co-authored-by: ZX <zx@lbx.dev>
2024-06-11 16:39:52 -07:00
Fabian Preiß
5f283991e9 [Minor] Correct Optional type hints in api (#526) 2024-06-11 16:37:27 -07:00
Fabian Preiß
542bc733d6 Fix missing numpy dependency in pyproject.toml (#524) 2024-06-10 12:13:50 -07:00
Lianmin Zheng
f6dbd24043 Improve doc strings (#518) 2024-06-08 02:39:32 -07:00
Lianmin Zheng
e8a2327d52 Update version to 0.1.17 (#515) 2024-06-07 19:49:18 -07:00
Lianmin Zheng
91f93f141f Crash the server when error or OOM happens (#514) 2024-06-07 19:22:34 -07:00
Qubitium
f70f72586a Fix rid state map leak + Refractor .finished (#505)
Co-authored-by: ZX <zx@lbx.dev>
2024-06-07 13:20:40 -07:00
Lianmin Zheng
c0ae70c8ed Improve logging & fix litellm dependency. (#512) 2024-06-07 13:10:32 -07:00
胡译文
87260b7bfd Litellm Backend (#502) 2024-06-07 12:24:28 -07:00
Amos You
651a23ee7c remove redundant pad_input_ids function (#500) 2024-06-07 12:23:29 -07:00
Lianmin Zheng
bf3e271fe0 Update vllm to v0.4.3 (#511)
Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-06-07 12:11:31 -07:00
Lianmin Zheng
3bc01ac137 [Minor] improve code style 2024-06-03 18:11:34 -07:00
Lianmin Zheng
159cc741e4 Make the server random by default (#493) 2024-05-31 23:33:34 -07:00
Ying Sheng
83525a1df2 Revert "Make the server random by default" (#492) 2024-05-31 12:00:21 -07:00