Commit Graph

  • da2e5d6546 Fix the default argument of OpenAI Chat completion (#605) Lianmin Zheng 2024-07-09 01:52:55 -07:00
  • ce62dc73f0 Update model_support.md Lianmin Zheng 2024-07-09 01:32:46 -07:00
  • 02b7258658 [Feat] Expose logprob options to sgl.gen API (#503) 胡译文 2024-07-09 15:35:39 +08:00
  • d557e9f3b7 Update chat template for qwen and yi-1.5. (#530) prophe 2024-07-09 14:55:44 +08:00
  • 740c46a152 Add Qwen2 MoE support (#603) Tommy Yang 2024-07-09 14:44:59 +08:00
  • b38687226a Make sglang compat with vllm 0.5.1 (#598) Tommy Yang 2024-07-09 14:44:22 +08:00
  • 710f614ebe add minicpm support (#602) Pan Lyu 2024-07-09 14:27:04 +08:00
  • f25b76c02a add LogitsMetadata (#604) Liangsheng Yin 2024-07-08 17:46:55 -07:00
  • f4e885b7c3 Reduce number of workspaces (#601) Mingyi 2024-07-07 19:35:22 -07:00
  • 0877f1e75b Fix streaming (#600) Liangsheng Yin 2024-07-07 01:55:58 -07:00
  • 5304b4ef58 Add --enable-p2p-check option (#599) Liangsheng Yin 2024-07-06 23:34:10 -07:00
  • 26908d9568 * fix(detokenizer_manager.py): fix truncated decoded output (#586) Pan Lyu 2024-07-07 05:53:22 +08:00
  • c0982ac553 Fix Llava model (#594) Mingyi 2024-07-06 00:58:46 -07:00
  • dc1b8bcfaa Format (#593) Ying Sheng 2024-07-05 10:06:17 -07:00
  • 5a57b8addd Add Gemma2 (#592) Ying Sheng 2024-07-05 09:48:54 -07:00
  • d737da5f17 Update README.md Lianmin Zheng 2024-07-04 00:55:40 -07:00
  • ac11388756 Add docker file (#588) Ying Sheng 2024-07-04 00:53:49 -07:00
  • dc8cef1d0c Update README.md Lianmin Zheng 2024-07-04 00:05:40 -07:00
  • 2f11936f95 bump version to 0.1.18 Ying Sheng 2024-07-04 06:27:29 +00:00
  • 63fbef9876 fix flashinfer & http log level Lianmin Zheng 2024-07-03 23:19:33 -07:00
  • 2a754e57b0 2x performance improvement for large prefill & Fix workspace conflicts (#579) Ying Sheng 2024-07-03 16:14:57 -07:00
  • 96c503eb60 fix the broken server args (#585) Liangsheng Yin 2024-07-04 07:01:19 +08:00
  • 441cca773d support gptj style rope in llama Chen Xuechen Li 2024-07-03 12:23:30 -07:00
  • c7709d3abe Update install commands (#583) Lianmin Zheng 2024-07-03 02:07:34 -07:00
  • 9380f50ff9 Turn on flashinfer by default (#578) Ying Sheng 2024-07-02 02:25:07 -07:00
  • 95dc093b19 [BugFix] gemma loading weights "lm_head.weight" key error (#577) Daniel Hernandez Garcia 2024-07-02 06:10:07 +01:00
  • d9ac639202 Fix flashinfer version (#576) Yueyang Pan 2024-07-02 07:08:39 +02:00
  • 26294b2f3d Update README.md Lianmin Zheng 2024-07-01 09:54:08 -07:00
  • 75b31a2a88 Update run_batch interface and max_prefill_tokens (#574) Ying Sheng 2024-06-30 18:26:04 -07:00
  • 11616fc6bd Minor fix in compiler & format (#545) sglang 2024-06-29 23:42:14 -07:00
  • 9ce89bc14b Update benchmark script (#571) Ying Sheng 2024-06-28 00:44:22 -07:00
  • badf3fa020 Expose dtype argument (#569) Lianmin Zheng 2024-06-27 23:30:39 -07:00
  • 945aa9beb2 Update readme (#568) Lianmin Zheng 2024-06-27 11:37:49 -07:00
  • 2e6e62e156 Increase the number of thread limitation for tp worker managers. (#567) Lianmin Zheng 2024-06-26 09:33:45 -07:00
  • a385ee27bd Warmup cublas (#566) Lianmin Zheng 2024-06-25 12:46:00 -07:00
  • eb1ae6ae0c Add sglang.bench_latency for offline benchmark (#564) Lianmin Zheng 2024-06-25 03:38:04 -07:00
  • 2187f36237 Add a new arguments log_level_http to control the HTTP logging (#563) Lianmin Zheng 2024-06-25 01:16:20 -07:00
  • 9465b668b9 Allow running with vllm==0.4.3 (#561) Lianmin Zheng 2024-06-24 15:24:21 -07:00
  • 05471f2103 Update test_flashinfer (#560) Liangsheng Yin 2024-06-24 15:23:57 +08:00
  • 1fa15099d8 Add LlamaForClassification (#559) Lianmin Zheng 2024-06-22 00:45:33 -07:00
  • 303ef8883e Clean up logits processor (#558) Lianmin Zheng 2024-06-22 00:25:24 -07:00
  • 92cb93f390 Fix latency benchmark (#557) Liangsheng Yin 2024-06-22 15:11:04 +08:00
  • e94e60d6fb make flashinfer workspace larger Lianmin Zheng 2024-06-21 17:32:36 -07:00
  • d2f8bfb2e1 Follow-up fixes for flashinfer 0.0.5 (#556) Lianmin Zheng 2024-06-20 23:19:52 -07:00
  • b7e2f800ac Update flashinfer to 0.0.5 (#554) Lianmin Zheng 2024-06-20 20:29:06 -07:00
  • 09593e9bc9 Multi-node Tensor Parallelism (#550) Ying Sheng 2024-06-17 20:41:24 -07:00
  • 53a7ebd89a Update fused_moe (#553) Lianmin Zheng 2024-06-17 09:47:58 -07:00
  • ad5f04d6ce Fix the Jump-Forward with Chinese (#551) Liangsheng Yin 2024-06-16 21:45:04 +08:00
  • bbec01c9aa Fix tp worker only checking req[0] for stream (#546) Qubitium-modelcloud 2024-06-15 13:56:10 +08:00
  • 40e53d65cb Add disk cache for loading ShareGPT dataset. (#542) Liangsheng Yin 2024-06-13 16:37:12 +08:00
  • fb9296f0ed Higher priority for user input of max_prefill_tokens & format (#540) Ying Sheng 2024-06-12 21:48:40 -07:00
  • 1374334d38 Fix dependency & crash issues (#539) Ying Sheng 2024-06-12 21:23:19 -07:00
  • 94aead9e8d Fix dependency (#538) Lianmin Zheng 2024-06-12 13:17:35 -07:00
  • 9c902b1954 Decode Incrementally (#517) Liangsheng Yin 2024-06-12 14:39:12 +08:00
  • 111991fe23 Fix Regression: Disable p2p for 4090 (#531) ZhouXingg 2024-06-12 14:27:17 +08:00
  • a8c787d2b3 Add ChatGLM Model Support (#516) Qubitium 2024-06-12 07:39:52 +08:00
  • 5f283991e9 [Minor] Correct Optional type hints in api (#526) Fabian Preiß 2024-06-12 01:37:27 +02:00
  • b6667a53b9 Fix RAG nb, parea setup (parea -> parea-ai) (#525) Fabian Preiß 2024-06-12 01:36:43 +02:00
  • 542bc733d6 Fix missing numpy dependency in pyproject.toml (#524) Fabian Preiß 2024-06-10 21:13:50 +02:00
  • f6dbd24043 Improve doc strings (#518) Lianmin Zheng 2024-06-08 02:06:52 -07:00
  • e8a2327d52 Update version to 0.1.17 (#515) Lianmin Zheng 2024-06-07 19:49:18 -07:00
  • 91f93f141f Crash the server when error or OOM happens (#514) Lianmin Zheng 2024-06-07 19:22:34 -07:00
  • f70f72586a Fix rid state map leak + Refractor .finished (#505) Qubitium 2024-06-08 04:20:40 +08:00
  • c0ae70c8ed Improve logging & fix litellm dependency. (#512) Lianmin Zheng 2024-06-07 12:51:40 -07:00
  • 87260b7bfd Litellm Backend (#502) 胡译文 2024-06-08 03:24:28 +08:00
  • 651a23ee7c remove redundant pad_input_ids function (#500) Amos You 2024-06-07 12:23:29 -07:00
  • bf3e271fe0 Update vllm to v0.4.3 (#511) Lianmin Zheng 2024-06-07 12:11:31 -07:00
  • 3bc01ac137 [Minor] improve code style Lianmin Zheng 2024-06-03 18:11:34 -07:00
  • 9f009261f2 Improve docs Lianmin Zheng 2024-06-01 17:46:08 -05:00
  • 159cc741e4 Make the server random by default (#493) Lianmin Zheng 2024-05-31 23:33:34 -07:00
  • 7d1ebc2d71 update the script: examples/usage/llava_video/srt_example_llava_v.sh (#491) Yuanhan Zhang 2024-06-01 14:31:56 +08:00
  • 83525a1df2 Revert "Make the server random by default" (#492) Ying Sheng 2024-05-31 12:00:21 -07:00
  • 80a33ce8b0 Do not set the default value of global random seed (#488) Lianmin Zheng 2024-05-29 18:41:18 -04:00
  • 1a57e41679 do not launch workers in parallel Lianmin Zheng 2024-05-27 23:00:16 -07:00
  • adc974268a Update docs (#486) Lianmin Zheng 2024-05-27 22:46:04 -07:00
  • 0463f7fb52 Support data parallelism (static) (#480) Ying Sheng 2024-05-27 21:24:10 -07:00
  • 565d727409 improve logging & fix vllm version Lianmin Zheng 2024-05-27 14:32:05 -07:00
  • 09de730dee Improve benchmark scripts & add more models (#484) Lianmin Zheng 2024-05-27 14:13:26 -07:00
  • 55c1643627 Improve benchmark scripts & rename some scripts (#477) Lianmin Zheng 2024-05-26 12:51:45 -07:00
  • 2b605ab1d7 [Feat/Fix] Refactoring Llava models into single file (#475) Li Bo 2024-05-27 03:29:51 +08:00
  • 947bda73fe Add benchmark scripts (#476) Ying Sheng 2024-05-26 12:09:03 -07:00
  • f06e90c2cf Optimize retract (#440) Liangsheng Yin 2024-05-26 00:07:26 +08:00
  • 2cea6146d8 Improve logging & add logit cap (#471) Lianmin Zheng 2024-05-24 03:48:53 -07:00
  • 44c998fcb5 Add the instruction link to the LLaVA-NeXT-Video at README (#463) Yuanhan Zhang 2024-05-24 18:38:20 +08:00
  • 3167d8dabc fix test bug in srt_llava_next_test.py (#470) bing 2024-05-24 18:38:01 +08:00
  • 0fafc5606b port fp8 mixtral (#460) Lianmin Zheng 2024-05-21 11:46:35 -07:00
  • 19d2135cb8 Use model loader from vllm (#459) Lianmin Zheng 2024-05-21 09:13:37 -07:00
  • ced77c6626 Rename api_num_spec_tokens -> num_api_spec_tokens (#458) Lianmin Zheng 2024-05-20 18:44:23 -07:00
  • 8dbdc018a3 Abort disconnected requests (#457) Lianmin Zheng 2024-05-20 18:41:21 -07:00
  • 3e684be7a3 Fix openai speculative execution (#456) Ying Sheng 2024-05-20 17:01:13 -07:00
  • ec380dfd30 openai chat speculative execution (#250) LiviaSun 2024-05-18 22:23:53 -07:00
  • 5b647543c1 Fix the broken --disable-radix-cache (#451) Liangsheng Yin 2024-05-19 13:00:12 +08:00
  • 8210ec60f4 Improve error handling & abort disconnected requests (#449) Lianmin Zheng 2024-05-17 05:49:31 -07:00
  • 5be9eb8a8c Add PUT for generate api (#448) Ying Sheng 2024-05-17 02:35:15 -07:00
  • c05956e534 Simplify port allocation (#447) Lianmin Zheng 2024-05-16 18:07:30 -07:00
  • d75dc20fae Add finish_reason to OpenAI API (#446) Matthias Gerstgrasser 2024-05-16 14:55:05 -07:00
  • 690d162d97 Format code (#441) Liangsheng Yin 2024-05-14 22:40:46 +08:00
  • 664287b2a7 [Feat] Add llava qwen, llava mistral (#419) Kaichen Zhang - NTU 2024-05-14 13:17:50 +08:00
  • e0ae5d42ec Update version to 0.1.16 (#438) Lianmin Zheng 2024-05-13 17:29:17 -07:00
  • 32de16ce2f Fix streaming (#437) Lianmin Zheng 2024-05-13 17:26:18 -07:00