Commit Graph

  • d1b31b0684 Improve docs and fix the broken links (#1875) Lianmin Zheng 2024-11-01 17:47:44 -07:00
  • d59a47828c [3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. (#1871) jacky.cheng 2024-11-02 03:12:59 +08:00
  • 104bf2609b minor: update nightly eval (#1867) Yineng Zhang 2024-11-01 21:38:29 +08:00
  • 3bf3d011ed Add vlm document (#1866) Chayenne 2024-11-01 00:51:15 -07:00
  • d86a2d6562 minor: add human eval (#1754) Yineng Zhang 2024-11-01 14:29:20 +08:00
  • 16eb33ffe2 Update vocab embedding deps and add TP switch (#1856) Ke Bao 2024-11-01 11:13:07 +08:00
  • 61cf00e112 change file tree (#1859) Chayenne 2024-10-31 20:10:16 -07:00
  • b9fd178f1b Fix retraction + overlap (#1860) Liangsheng Yin 2024-10-31 18:27:42 -07:00
  • d8e9d61f86 [Build, ROCm] Dockerfile.rocm for Instinct GPUs, with package updates (#1861) HAI 2024-10-31 16:38:16 -07:00
  • a2e0424abf Fix memory leak for chunked prefill 2 (#1858) Lianmin Zheng 2024-10-31 14:51:51 -07:00
  • 8ce202a493 delete unused character (#1855) geeker-smallwhite 2024-10-31 19:33:55 +08:00
  • d913d52c9a Fix warnings in doc build (#1852) Lianmin Zheng 2024-10-30 22:28:00 -07:00
  • 0ab7bcaf66 Simplify documentation in README.md (#1851) Lianmin Zheng 2024-10-30 21:57:49 -07:00
  • 438526a814 Refactor tokenizer manager (#1846) Byron Hsu 2024-10-30 21:32:18 -07:00
  • f7102fbd2b Fix mixed chunked prefill (#1850) Lianmin Zheng 2024-10-30 21:20:41 -07:00
  • a7a0a6886b Make decode log interval configurable (#1847) Byron Hsu 2024-10-30 19:59:20 -07:00
  • 2d4ce1b792 [Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… (#1845) HAI 2024-10-30 17:33:36 -07:00
  • 4ba815b84e Fix suggest edit (#1842) Chayenne 2024-10-30 12:28:12 -07:00
  • 5f65e2b830 [Performance, Hardware] MoE weights padding to AMD MI300x GPUs (#1836) HAI 2024-10-30 12:17:32 -07:00
  • 4e2af03cfa [Production] Drain requests before exit when receive SIGTERM (#1838) Ying Sheng 2024-10-30 10:22:56 -07:00
  • 3184aa95a7 Update README.md (#1840) Lianmin Zheng 2024-10-30 03:16:43 -07:00
  • b548801ddb Update docs (#1839) Lianmin Zheng 2024-10-30 02:49:08 -07:00
  • 539df95d2c Imporve openai api documents (#1827) Chayenne 2024-10-30 00:39:41 -07:00
  • 5e00ddebc0 Add new model: Gpt2 (#1833) DanielC12321 2024-10-29 19:52:33 -05:00
  • 54dd3ea122 [FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… (#1835) HAI 2024-10-29 13:58:03 -07:00
  • d04899d7ca stop_str of qwen2-vl template should be a tuple not a str (#1834) yizhang2077 2024-10-30 04:30:41 +08:00
  • 5010e0d2ca [3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added (#1822) HAI 2024-10-29 10:51:02 -07:00
  • 5e6c32657e Support setting use_thread in the run_program for easier debugging. (#1823) Yanyi Liu 2024-10-29 14:51:47 +08:00
  • 680cad2023 fix get_memory_pool_size deadlock for DP (#1830) Byron Hsu 2024-10-28 23:07:14 -07:00
  • 0a24eb850a Fix update_weights deadlock for DP (#1825) Byron Hsu 2024-10-28 12:02:23 -07:00
  • 3839be2913 [Router] Add a rust-based router (#1790) Byron Hsu 2024-10-28 09:49:48 -07:00
  • 6e13b650a9 Fix docs deploy ci (#1821) Chayenne 2024-10-27 21:03:41 -07:00
  • 6fcd6d7d6d Support token ids in engine.generate (#1820) Byron Hsu 2024-10-27 14:02:34 -07:00
  • c77762d57f Fix Triton decode kernel & ut (#1819) Ke Bao 2024-10-28 01:54:38 +08:00
  • 51c81e339b Add openAI compatible API (#1810) Chayenne 2024-10-27 10:51:42 -07:00
  • eaade87a42 Fix unit tests (#1817) Lianmin Zheng 2024-10-27 03:04:54 -07:00
  • 86fc0d79d0 Add a watch dog thread (#1816) Lianmin Zheng 2024-10-27 02:00:50 -07:00
  • 1be853ee69 Update hyperparameter_tuning.md (#1813) Lianmin Zheng 2024-10-26 21:08:44 -07:00
  • 86e0dde555 Improve the user control of new_token_ratio (#1811) Lianmin Zheng 2024-10-26 16:39:41 -07:00
  • 2b80978859 Provide an argument to set the maximum batch size for cuda graph (#1809) Lianmin Zheng 2024-10-26 15:09:33 -07:00
  • 9d6fb08457 Fix docs ci (#1808) Chayenne 2024-10-26 11:23:51 -07:00
  • ced362f7c6 Simplify our docs with complicated functions into utils (#1807) Chayenne 2024-10-26 10:44:11 -07:00
  • 9084a86445 Update links (#1805) Lianmin Zheng 2024-10-26 04:46:01 -07:00
  • 6aa94b967c Update ci workflows (#1804) Lianmin Zheng 2024-10-26 04:32:36 -07:00
  • c26507484f fix int conversion for SGLANG_CPU_COUNT (#1803) Byron Hsu 2024-10-26 00:09:44 -07:00
  • 07bf2e846a Allow consecutive ports when launching multiple sglang servers. (#1802) Liangsheng Yin 2024-10-25 23:43:24 -07:00
  • a628dd8e31 Set ZMQ buffer size heuristic (#1801) Liangsheng Yin 2024-10-25 23:15:56 -07:00
  • 1e8903414a Fix possible ZMQ hanging (#1800) Liangsheng Yin 2024-10-25 23:07:07 -07:00
  • 715b16c140 Add support for ipynb (#1786) Chayenne 2024-10-25 20:48:35 -07:00
  • 9ce8e1a93c move max_position_embeddings to the last (#1799) Hui Liu 2024-10-25 19:30:50 -07:00
  • fb99aaa527 [Fix] Fix --skip-tokenizer-init (#1798) Lianmin Zheng 2024-10-25 18:51:59 -07:00
  • b77a02cdfd [Performance] Support both xgrammar and outlines for constrained decoding (#1752) DarkSharpness 2024-10-26 06:47:02 +09:00
  • 30643fed7f Release v0.3.4.post2 (#1796) Lianmin Zheng 2024-10-25 11:07:19 -07:00
  • e646c5901e Fix logprob in the overlapped mode (#1795) Lianmin Zheng 2024-10-25 11:06:57 -07:00
  • c555ce2ca2 Revert "Fix memory leak when doing chunked prefill" (#1797) Lianmin Zheng 2024-10-25 10:24:44 -07:00
  • 40900baea7 [Fix] Fix the log parsing in chunked prefill uni tests (#1794) Lianmin Zheng 2024-10-25 08:31:08 -07:00
  • a2f5e7555f Fix memory leak when doing chunked prefill (#1787) Liangsheng Yin 2024-10-25 08:01:17 -07:00
  • 2148914e1b Fix log parsing in the chunked prefill unit tests (#1793) Lianmin Zheng 2024-10-25 08:00:55 -07:00
  • def55bc876 Qwen2vl support cuda graph and disable radix cache (#1780) yizhang2077 2024-10-25 22:45:17 +08:00
  • 86a2c473b7 [Fix] Fix seq_lens_sum for cuda graph runner in padded cases (#1789) Lianmin Zheng 2024-10-24 21:26:05 -07:00
  • 1701b0db31 Enhance the test case for chunked prefill (#1785) Lianmin Zheng 2024-10-24 21:23:09 -07:00
  • 384d85ba35 Re-introduce get_cuda_graph_seq_len_fill_value (#1783) Lianmin Zheng 2024-10-24 13:30:11 -07:00
  • 605972195b check user-specified model_max_len with hf derived max_model_len (#1778) Xiaoyu Zhang 2024-10-25 03:40:36 +08:00
  • fc82f5a743 [Fix] Fix cuda graph padding for triton attention backend (#1782) Lianmin Zheng 2024-10-24 12:33:15 -07:00
  • 0089c4bc96 [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer (#1779) Lianmin Zheng 2024-10-24 04:16:59 -07:00
  • 72e7b57a75 [Bug] Catch any errors caused by parsing json schema (#1776) zolinthecow 2024-10-24 01:54:53 -07:00
  • 87a7cfa080 Fix MockTokenizer in the unit tests (#1774) Lianmin Zheng 2024-10-23 17:47:05 -07:00
  • 8f8f96a621 Fix the perf regression due to additional_stop_token_ids (#1773) Lianmin Zheng 2024-10-23 16:45:21 -07:00
  • 05b3bf5e8e Crash the server on warnings in CI (#1772) Lianmin Zheng 2024-10-23 16:27:13 -07:00
  • 3f5ac88d02 Fix out of memory message. (#1771) Liangsheng Yin 2024-10-23 15:20:39 -07:00
  • 0d800090b4 Fix missing additional_stop_token_ids (#1769) Lianmin Zheng 2024-10-23 12:18:59 -07:00
  • b7d0559496 Update docs (#1768) Lianmin Zheng 2024-10-23 11:28:48 -07:00
  • 80a905475d Fix stop condition for <|eom_id|> (#1766) Lianmin Zheng 2024-10-23 10:47:12 -07:00
  • 9af7b88e3c [Fix] Fix abort in dp (#1767) Lianmin Zheng 2024-10-23 10:46:29 -07:00
  • fbcbb26327 Fix perf regression for set_kv_buffer (#1765) Lianmin Zheng 2024-10-23 09:57:08 -07:00
  • 2fce449b1c [API] add get memory pool size (#1760) Ying Sheng 2024-10-23 00:02:29 -07:00
  • ad4125d1a9 Fuse more ops & Simplify token mapping (#1758) Lianmin Zheng 2024-10-22 23:20:43 -07:00
  • 17536e7e3d Fix edge case for truncated (#1747) Byron Hsu 2024-10-22 21:00:25 -07:00
  • 1f26e8b8e4 Release v0.3.4.post1 (#1749) Lianmin Zheng 2024-10-21 21:16:43 -07:00
  • 5e1558f1f2 Update max_req_len and max_req_input_len (#1748) Liangsheng Yin 2024-10-21 16:12:04 -07:00
  • 94cde10920 Llama3.2 vision model support (#1551) Liangsheng Yin 2024-10-21 15:01:21 -07:00
  • 00611286a1 Fix sliding window attention and gemma-2 unit tests in CI (#1746) Lianmin Zheng 2024-10-21 13:47:12 -07:00
  • e68b9e7667 misc: add CODEOWNERS (#1737) Yineng Zhang 2024-10-21 06:28:32 -07:00
  • 7ce3606891 Faster overlap mode scheduler (#1738) Lianmin Zheng 2024-10-21 04:30:52 -07:00
  • efb099cdee Fix prefill oom (#1743) Liangsheng Yin 2024-10-21 03:54:35 -07:00
  • 09603c6dc9 Maintain seq_lens_sum to make more FlashInfer operations non-blocking (#1741) Lianmin Zheng 2024-10-21 01:43:16 -07:00
  • cf470fea32 Make token mapping non-blocking in the overlapped mode (#1740) Lianmin Zheng 2024-10-20 23:25:14 -07:00
  • 45d5af2416 Add GLM-4 TextGeneration Model support for SGLang (#1736) sixgod 2024-10-21 12:08:30 +08:00
  • b121bc03a3 Simplify batch result resolution (#1735) Lianmin Zheng 2024-10-20 19:47:14 -07:00
  • e12358dc91 Simplify the usage of device (#1734) Lianmin Zheng 2024-10-20 18:17:41 -07:00
  • 554fbf93cd [Bugfix] qwen2vl forward_extend (#1727) yizhang2077 2024-10-20 17:38:35 +08:00
  • b48edff67f Split the overlapped version of TpModelWorkerClient into a separate file (#1726) Lianmin Zheng 2024-10-20 00:29:29 -07:00
  • 593b19f29d Temporarily skip this test_mixed_batch for QWen2VL (#1725) Lianmin Zheng 2024-10-20 00:05:45 -07:00
  • 59cbf47626 Unify the memory pool api and tp worker API (#1724) Lianmin Zheng 2024-10-19 23:19:26 -07:00
  • 95946271af Update README.md Ying Sheng 2024-10-19 22:29:12 -07:00
  • 5c4ce65631 Update README.md (#1722) Ying Sheng 2024-10-19 22:27:38 -07:00
  • cbbc82b7b8 Support qwen2 vl model (#1721) Yineng Zhang 2024-10-19 21:44:38 -07:00
  • 8bee20f80b Update vllm to 0.6.3 (#1711) (#1720) Yineng Zhang 2024-10-19 20:45:41 -07:00
  • 12cad0feae Simplify the interface of tp_worker (#1718) Lianmin Zheng 2024-10-19 17:39:38 -07:00
  • b6cd903604 Update readme and workflow (#1716) Lianmin Zheng 2024-10-19 12:58:55 -07:00