Commit Graph

  • 321a963b01 misc: update doc (#715) Yineng Zhang 2024-07-25 06:05:46 +10:00
  • e17deb27b5 fix: llama 3.1 405b fp8 (#714) Yineng Zhang 2024-07-25 02:37:41 +10:00
  • 2d3ae4e125 docs: update doc (#713) Yineng Zhang 2024-07-25 00:03:17 +10:00
  • 75f4ccb7dd docs: update README (#712) Yineng Zhang 2024-07-24 23:33:28 +10:00
  • 83d2b30d75 format Ying Sheng 2024-07-24 10:53:07 +00:00
  • 4367f4bb8d Fix prefill size (#711) Ying Sheng 2024-07-24 03:41:15 -07:00
  • 00e4baa728 Update schedule_heuristic.py Lianmin Zheng 2024-07-24 01:22:30 -07:00
  • 4cd64b8ee6 Auto adjust new ratio (#708) Liangsheng Yin 2024-07-23 22:06:02 -07:00
  • 01d66ae2e8 Fix multi-node deadlock (#709) Lianmin Zheng 2024-07-23 21:53:36 -07:00
  • a523a3c13a Reduce hardcoded logic of kernel usage (#707) Mingyi 2024-07-23 16:42:21 -07:00
  • 9f94728f5a bump version to 0.1.23 (#706) Ying Sheng 2024-07-23 13:53:19 -07:00
  • 444a02441a Update vllm version to support llama3.1 (#705) Ying Sheng 2024-07-23 13:49:34 -07:00
  • fa7ccb3316 feat: add e2e latency (#704) zhyncs 2024-07-24 05:51:10 +10:00
  • 268684439b Use min new token ratio at start (#701) Liangsheng Yin 2024-07-23 11:52:50 -07:00
  • 824a77d04d Fix hf config loading (#702) Ke Bao 2024-07-24 02:39:08 +08:00
  • cf99eab7d5 Fix flashinfer (#700) Ying Sheng 2024-07-23 01:27:01 -07:00
  • 9fdea29d05 misc: fix typo (#698) zhyncs 2024-07-23 02:00:27 +10:00
  • df7c4c19b4 Fix trt benchmark (#697) Ying Sheng 2024-07-22 06:32:41 -07:00
  • c3f1aac811 Tune params (#696) Ying Sheng 2024-07-22 03:19:24 -07:00
  • d198791fe8 misc: update output token logic (#695) zhyncs 2024-07-22 19:34:05 +10:00
  • c07526e46c fix: update bench serving (#694) zhyncs 2024-07-22 18:23:33 +10:00
  • 7b597475f2 docs: update README (#692) zhyncs 2024-07-22 03:41:20 +10:00
  • 5303c1ed22 Support Mistral-Nemo (#691) Ke Bao 2024-07-22 01:36:53 +08:00
  • 65bd13386b misc: recommend to use chat model for benchmark (#690) zhyncs 2024-07-22 00:13:33 +10:00
  • eedc12e12e Support Deepseek MoE Model (#689) Liangsheng Yin 2024-07-21 03:09:29 -07:00
  • 5a4ef2b5c8 update readme Lianmin Zheng 2024-07-21 02:58:57 -07:00
  • 9dab947d56 docs: update README (#688) zhyncs 2024-07-21 18:32:58 +10:00
  • 33ee97b0bf Allow disabling streaming in bench (#687) Lianmin Zheng 2024-07-21 01:12:34 -07:00
  • 6a846bb1fd misc: update output file logic (#686) zhyncs 2024-07-21 18:07:30 +10:00
  • 0fdb3127a1 feat: update bench serving (#685) zhyncs 2024-07-21 16:46:58 +10:00
  • 5ad033a070 Fix StreamExecutor.fork() losing the current role start index. (#684) Max Shawabkeh 2024-07-20 23:32:11 -07:00
  • 77e592e8e0 support non-streaming benchmark (#682) Lianmin Zheng 2024-07-20 18:36:42 -07:00
  • caaad53b52 Support gpt-bigcode model class (#681) Liangsheng Yin 2024-07-20 18:34:37 -07:00
  • 69d19188fc Decouple kv (#679) Liangsheng Yin 2024-07-20 14:16:45 -07:00
  • 4b4a67f814 feat: support TRT LLM benchmark and multiple benchmarks (#670) zhyncs 2024-07-21 04:05:35 +10:00
  • 0ac94c36cb Fallback when sampling failed (#678) Ke Bao 2024-07-21 01:44:54 +08:00
  • 2b4c646277 Update version to 0.1.22 (#677) Ying Sheng 2024-07-20 03:39:50 -07:00
  • f424e76d96 Fix illegal tokens during sampling (#676) Liangsheng Yin 2024-07-20 03:11:15 -07:00
  • 490a1f39dd Fix cuda graph with flashinfer (#675) Lianmin Zheng 2024-07-20 02:43:55 -07:00
  • 06487f126e refactor model loader: initial refactor (#664) Ying Sheng 2024-07-20 02:18:22 -07:00
  • 39c57317e1 Revert "Temporary fix invalid sample results" (#673) Liangsheng Yin 2024-07-20 02:06:31 -07:00
  • 9592a1f3bd Fix random dataset (#671) Lianmin Zheng 2024-07-20 01:57:43 -07:00
  • 35759efa91 Support random dataset in bench_serving.py (#669) Lianmin Zheng 2024-07-20 01:06:43 -07:00
  • 8f4b1559e7 Temporary fix invalid sample results (#668) Liangsheng Yin 2024-07-20 00:51:05 -07:00
  • e3046ea3a8 Update OpenAI API (#667) Mingyi 2024-07-19 23:20:54 -07:00
  • 49c5e0eca9 Add support for OpenAI API parallel sampling (#640) yichuan~ 2024-07-20 14:10:01 +08:00
  • ec2150b294 Fix kill process util (#666) Ke Bao 2024-07-20 12:43:11 +08:00
  • 7620cd37dd Fix jump forward when streaming (#665) Liangsheng Yin 2024-07-19 16:42:06 -07:00
  • 50a53887be Update docs Ying Sheng 2024-07-19 11:40:06 -07:00
  • 11c8efff73 Add benchmark instructions (#663) Ying Sheng 2024-07-19 11:12:23 -07:00
  • e87c7fd501 Improve docs (#662) Ying Sheng 2024-07-19 10:58:03 -07:00
  • 630479c3a6 feat: update check env (#661) zhyncs 2024-07-20 02:54:15 +10:00
  • 51fda1439f Update Readme (#660) Ying Sheng 2024-07-19 09:54:01 -07:00
  • dc4e4a6acc misc: update SGLang package description (#659) zhyncs 2024-07-20 02:27:39 +10:00
  • 2d96da813e refactor model loader [unreachable code]: initial refactor (#655) Ying Sheng 2024-07-19 09:27:06 -07:00
  • c126a6ccba feat: add benchmark serving (#657) zhyncs 2024-07-20 02:15:21 +10:00
  • ac971ff633 perf: reduce ttft and itl with stream_interval 1 (#658) zhyncs 2024-07-20 02:14:22 +10:00
  • e1792cca24 Remove cached triton launcher (#656) Lianmin Zheng 2024-07-18 23:28:40 -07:00
  • 1b7adbb5a0 TokenizerManager.context_len should inherit from `server_args.conte… (#654) shrirajh 2024-07-19 14:25:29 +09:30
  • a9ef49c12c Detokenize incrementally when streaming (#653) Liangsheng Yin 2024-07-18 17:57:40 -07:00
  • 21ba3a88a1 Remove useless variables in infer_batch.py (#651) Ying Sheng 2024-07-18 05:31:44 -07:00
  • 9c5cac2450 fix: resolve lint error (#650) zhyncs 2024-07-18 20:33:21 +10:00
  • 5960a6e505 feat: add lint workflow (#648) zhyncs 2024-07-18 20:04:30 +10:00
  • b050d9283f fix: set ulimit -n 65535 (#647) zhyncs 2024-07-18 19:35:45 +10:00
  • 6a4dc99697 misc: rm rpyc from PACKAGE_LIST (#649) zhyncs 2024-07-18 19:35:38 +10:00
  • d774acad5c Remove the dependency of rpyc (#646) Mingyi 2024-07-18 02:13:54 -07:00
  • d93388da3e feat: add check_env (#645) zhyncs 2024-07-18 14:39:28 +10:00
  • 476584cb6e Increase the capacity of the memory pool (#643) Ying Sheng 2024-07-17 15:44:41 -07:00
  • abd5385ac5 Move global_server_args_dict (#642) Liangsheng Yin 2024-07-17 13:49:15 -07:00
  • 3de2f30a27 Flashinfer sample kernel (#617) Liangsheng Yin 2024-07-17 13:24:43 -07:00
  • 4efcc59d4f misc: add issue and pr template (#638) zhyncs 2024-07-18 04:58:11 +10:00
  • 2e341cd493 misc: add pre-commit config (#637) zhyncs 2024-07-18 04:55:39 +10:00
  • a8552cb18b feat: support internlm2 (#636) zhyncs 2024-07-17 15:40:03 +10:00
  • a470e60c97 clean up step function (#635) Ying Sheng 2024-07-16 20:15:24 -07:00
  • 5f90e0769c Update README.md Ying Sheng 2024-07-16 19:18:54 -07:00
  • 8832ecb1e4 Reduce docker size (#632) Liangsheng Yin 2024-07-16 16:12:12 -07:00
  • 5ff60eda78 Fix vertexai (#633) Liangsheng Yin 2024-07-16 16:07:19 -07:00
  • c193002297 Add support for VertexAI safety settings (#624) Aidan Cooper 2024-07-16 19:54:42 +01:00
  • fe3be1595d Add qwen2 tie word embedding (#630) ylying 2024-07-17 02:48:49 +08:00
  • 0aa189f150 Disable NCCL_NVLS by default (#631) Ying Sheng 2024-07-16 09:05:10 -07:00
  • f6b29f6920 Update docker file (#629) Ying Sheng 2024-07-16 01:12:37 -07:00
  • c9ee3d3559 Fix model forward grad (#628) Liangsheng Yin 2024-07-15 22:09:09 -07:00
  • 41d1f67704 Fix flush cache (#627) Lianmin Zheng 2024-07-15 19:56:55 -07:00
  • 56f5fc4ab5 Bump version to 0.1.21 (#626) Ying Sheng 2024-07-15 13:10:53 -07:00
  • 6a2941f4d0 Improve tensor parallel performance (#625) Ying Sheng 2024-07-15 07:10:51 -07:00
  • 5ac8b80677 Simplify mem state (#623) Mingyi 2024-07-15 02:01:09 -07:00
  • bae9541e4c Update benchmark script (#621) Ying Sheng 2024-07-14 14:38:13 -07:00
  • a56858ba67 Unify index operations (#620) Liangsheng Yin 2024-07-14 12:55:55 -07:00
  • 564a898ad9 Optimize mem indices mangement (#619) Liangsheng Yin 2024-07-13 23:39:37 -07:00
  • 5d264a90ac Bump version to 0.1.20 (#618) Lianmin Zheng 2024-07-13 17:27:55 -07:00
  • 5949b1ca0e Fix memory pool index error (#616) Ying Sheng 2024-07-13 16:45:11 -07:00
  • 0feca02dd9 Improve benchmark scripts (#615) Lianmin Zheng 2024-07-13 15:59:04 -07:00
  • 10143e1a5f Memorypool chunked prefetch (#614) Liangsheng Yin 2024-07-13 15:24:03 -07:00
  • 65c6577696 Improve benchmark scripts & fix llava (#613) Lianmin Zheng 2024-07-13 15:00:26 -07:00
  • 665815969a Enable cuda graph by default (#612) Lianmin Zheng 2024-07-13 05:29:46 -07:00
  • 396a69240f Cleanup attention backend: flashinfer and triton (#611) Lianmin Zheng 2024-07-12 18:21:11 -07:00
  • af4e7910e7 Clean up the usage of flashinfer (#610) Lianmin Zheng 2024-07-12 13:00:03 -07:00
  • 519e20cfda Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py (#609) Lianmin Zheng 2024-07-12 12:28:09 -07:00
  • d9a6902986 Fix bench latency (#607) Lianmin Zheng 2024-07-11 14:37:01 -07:00
  • ad872feb14 bump version to 0.1.19 Lianmin Zheng 2024-07-09 02:23:14 -07:00