Commit Graph

  • 6cb3974e77 optimize custom allreduce kernel (#2904) yizhang2077 2025-01-16 03:04:25 +08:00
  • f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) Lianmin Zheng 2025-01-15 04:27:18 -08:00
  • b803b395b7 Disable graceful shutdown of tokenizer manager when not in the main thread (#2872) Cody Yu 2025-01-15 03:29:33 -08:00
  • bfbda62c8b Add ut for w8a8 int8 quantization (#2897) Ke Bao 2025-01-15 18:29:14 +08:00
  • b3e99dfb22 chore: bump v0.4.1.post6 (#2899) Yineng Zhang 2025-01-15 16:23:42 +08:00
  • f005758f2b introduce CUB in sgl-kernel (#2887) Xiaoyu Zhang 2025-01-14 19:48:59 +08:00
  • f5c6c66794 feat: support internlm 3 dense (#2888) Yineng Zhang 2025-01-14 19:23:26 +08:00
  • cc0485bef2 Support w8a8 int8 quantization config (#2881) Ke Bao 2025-01-14 17:07:49 +08:00
  • b8cd09f27a update ROCm docker for layernorm kernel optimization (#2885) kk 2025-01-14 16:59:43 +08:00
  • c19d84829c Adjust flashinfer workspace size for Qwen2 models (#2879) Ke Bao 2025-01-14 13:34:22 +08:00
  • 80002562a8 docs: update README (#2878) Yineng Zhang 2025-01-14 12:48:17 +08:00
  • 46d4431889 Add a new api configure_logging to allow dumping the requests (#2875) Lianmin Zheng 2025-01-13 14:24:00 -08:00
  • 923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) fzyzcjy 2025-01-14 03:38:51 +08:00
  • d08c77c434 Sampling penalties memory interface (#2870) Xiaoyu Zhang 2025-01-13 23:09:00 +08:00
  • c1e097ca66 Revert "Dump requests to a folder" (#2869) Lianmin Zheng 2025-01-13 06:21:25 -08:00
  • 6ec75e626d add qwen2 eagle model (#2863) Lzhang-hub 2025-01-13 21:29:33 +08:00
  • d855653bd4 minor: fix release docs (#2868) Yineng Zhang 2025-01-13 21:18:39 +08:00
  • 336ff5b9f5 Fix typos in io_struct.py (#2867) Lianmin Zheng 2025-01-13 05:13:02 -08:00
  • 3b141e1509 Dump requests (#2862) Lianmin Zheng 2025-01-13 04:51:56 -08:00
  • 6249e4a19e Revert "Integration of TurboMind AWQ" (#2866) Lianmin Zheng 2025-01-13 04:44:39 -08:00
  • f3516c2894 Fix quant kernel accuracy issue (#2865) Ke Bao 2025-01-13 20:32:17 +08:00
  • 17de02f98d Integration of TurboMind AWQ (#2828) bjmsong 2025-01-13 20:14:16 +08:00
  • 51ab3ccf47 Collect more metrics: num_requests_total (#2859) Lianmin Zheng 2025-01-13 03:57:39 -08:00
  • 67008f4b32 Use only one GPU for MLA CI tests (#2858) Lianmin Zheng 2025-01-13 03:55:33 -08:00
  • 4536d72446 minor: use ubuntu-latest instead of self-hosted runner for amd build (#2861) Yineng Zhang 2025-01-13 18:58:56 +08:00
  • 41d7e5b7e6 docs: update link (#2857) Yineng Zhang 2025-01-13 18:40:48 +08:00
  • 20a9f5dfe0 fix: not delete CNAME (#2860) Yineng Zhang 2025-01-13 18:36:40 +08:00
  • 42f3909963 Unify sglang coding style (#2856) kk 2025-01-13 18:12:44 +08:00
  • 72c7776355 Fix linear.py and improve weight loading (#2851) Lianmin Zheng 2025-01-13 01:39:14 -08:00
  • 4093aa4660 [Fix]eagle2 health_generate is first request,apiserver will core (#2853) justdoit 2025-01-13 17:01:21 +08:00
  • e808c1df3e Integrate ROCm ater package for ck moe function feasibility (#2854) kk 2025-01-13 16:23:07 +08:00
  • a18ab81ddd Update base image for ROCm (#2852) sogalin 2025-01-13 14:39:44 +08:00
  • 0bb0f76311 Support FP8 E4M3 KV Cache (#2786) bjmsong 2025-01-13 13:17:11 +08:00
  • 85b2e05770 Add int8 quant kernel (#2848) Ke Bao 2025-01-13 13:16:58 +08:00
  • a879c2fb4c fix sgl-kernel build (#2850) Yineng Zhang 2025-01-13 12:27:17 +08:00
  • e2b16c4716 add sampling_scaling_penalties kernel (#2846) Xiaoyu Zhang 2025-01-13 11:38:17 +08:00
  • c4f9707e16 Improve: Token-In Token-Out Usage for RLHF (#2843) Shi Shuai 2025-01-11 23:14:26 +00:00
  • 197cbf9bab docs: update README (#2841) Yineng Zhang 2025-01-11 23:11:38 +08:00
  • f624901cdd chore: bump v0.4.1.post5 (#2840) Yineng Zhang 2025-01-11 23:10:02 +08:00
  • f0e15dc6ab [HotFix] fix fp8 scale load failed in tp>1 (#2837) Xiaoyu Zhang 2025-01-11 14:34:26 +08:00
  • f1769586d6 Update threshold in test_nightly_gsm8k_eval.py (#2836) Lianmin Zheng 2025-01-10 20:37:34 -08:00
  • 5d6e9467d4 Cache controller for hierarchical caching (#2804) Zhiqiang Xie 2025-01-10 20:22:01 -08:00
  • a47bf39123 [Eagle2] Fix multiple concurrent request crashes (#2730) justdoit 2025-01-11 06:00:43 +08:00
  • b170646991 Fix port number overflow (#2826) TianYu GUO 2025-01-11 05:44:32 +08:00
  • 5413ec2bbe [Bugfix] Fix bug in fork logic caused by null text_ (#2835) Muqi Li 2025-01-11 05:37:00 +08:00
  • f290bd4332 [Bugfix] Fix embedding model hangs with --enable-metrics (#2822) Chang Su 2025-01-10 13:14:51 -08:00
  • 8f15789314 Add more metrics to serving benchmark. (#2819) Pratyush Patel 2025-01-10 07:30:44 -08:00
  • 2db03a04ca Update README.md (#2833) Lianmin Zheng 2025-01-10 03:49:04 -08:00
  • 5cc1170552 Doc: add block-wise FP8 in dpsk model reference (#2830) Chayenne 2025-01-10 00:26:59 -08:00
  • 11fffbc95a [Doc]: Deepseek reference docs (#2787) Xiaotong Jiang 2025-01-09 13:43:12 -08:00
  • 4f077c01b8 minor: support specifying local dataset path for gsm8k and hellaswag (#2816) sleepcoo 2025-01-09 22:24:42 +08:00
  • 679c3bcacf Fix typo in cuda_graph_bs (#2813) Lianmin Zheng 2025-01-09 03:03:24 -08:00
  • 656aed58c6 Remove vllm dependency in model config (#2809) Yunmeng 2025-01-09 17:51:56 +08:00
  • b5fb4ef58a Update modelopt config and fix running issue (#2792) Ke Bao 2025-01-08 18:04:30 +08:00
  • 2e6346fc2e Docs:Update the style of llma 3.1 405B docs (#2789) Chayenne 2025-01-08 01:07:54 -08:00
  • 977f785dad Docs: Rewrite docs for LLama 405B and ModelSpace (#2773) mlmz 2025-01-08 16:02:59 +08:00
  • 8a6906127a Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784) Lianmin Zheng 2025-01-07 23:29:10 -08:00
  • 694e41925e [eagle2] fix end check when target model verify (#2723) JJJJOHNSON 2025-01-08 13:46:02 +08:00
  • b22f3f6475 Fix nightly accuracy tests (#2780) Lianmin Zheng 2025-01-07 21:02:35 -08:00
  • 6fb5768372 Disable math eval on nightly CI temporarily (#2779) Lianmin Zheng 2025-01-07 18:17:34 -08:00
  • 51caee740f Host memory pool for hierarchical caching (#2771) Zhiqiang Xie 2025-01-07 13:38:37 -08:00
  • 58f9060efe Update int8 gemm config (#2774) Ke Bao 2025-01-07 19:47:37 +08:00
  • bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) Lianmin Zheng 2025-01-07 02:52:53 -08:00
  • 6d08ce2aa9 Use Optional with None default (#2770) HAI 2025-01-07 01:35:08 -08:00
  • 380930a959 add benchmark_moe_align_blocks (#2767) Xiaoyu Zhang 2025-01-07 14:20:50 +08:00
  • 9dec582dab Remove --modelopt-config in server_args (#2758) Lianmin Zheng 2025-01-06 16:35:45 -08:00
  • b01febdca0 Update README.md (#2757) Lianmin Zheng 2025-01-06 15:36:23 -08:00
  • 1acbaf1b5a Add generator-style run_batch function (#2513) Xingyao Wang 2025-01-06 18:04:55 -05:00
  • 287427e2e6 Enable Nvidia's ModelOpt fp8 quantized models (#2535) Zhiyu 2025-01-06 14:54:52 -08:00
  • b8574f6953 Clean up eagle code (#2756) Lianmin Zheng 2025-01-06 14:54:18 -08:00
  • 2855caa481 feat: add devcontainer.json for VSCode development (#2745) 王博伟 2025-01-07 06:00:55 +08:00
  • 2329e1ddd0 Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748) Xu-Chen 2025-01-07 05:56:28 +08:00
  • 0f3eb1d294 Support cutlass Int8 gemm (#2752) Ke Bao 2025-01-06 22:51:22 +08:00
  • 06dd2eab84 Remove unused var in moe_align_kernel (#2751) Ke Bao 2025-01-06 22:13:28 +08:00
  • 439f65809f Fix sgl-kernel cu118 compile issue (#2750) Ke Bao 2025-01-06 21:59:31 +08:00
  • 2f0d386496 chore: bump v0.4.1.post4 (#2713) Yineng Zhang 2025-01-06 01:29:54 +08:00
  • 3900a94afe Support twoshot kernel (#2688) yizhang2077 2025-01-06 00:47:16 +08:00
  • ded9fcd09a improve moe_align_kernel for deepseek v3 (#2735) Xiaoyu Zhang 2025-01-06 00:28:22 +08:00
  • bc6ad367c2 fix lint (#2733) Yineng Zhang 2025-01-05 14:45:42 +08:00
  • 3a22a303d1 Revert the GLOO_SOCKET_IFNAME change (#2731) Lianmin Zheng 2025-01-04 20:13:16 -08:00
  • bdb3929dbb Refactor SchedulePolicy to improve code organization (#2571) libra 2025-01-04 00:05:16 +08:00
  • f5d0865b25 feat: Support VLM in reference_hf (#2726) Ce Gao 2025-01-03 22:32:30 +08:00
  • afdee7b1a9 [Docs] fix 404 - Contributor Guide, again (#2727) Ce Gao 2025-01-03 22:21:38 +08:00
  • cb34d848ac Update README.md (#2722) Lianmin Zheng 2025-01-03 00:32:20 -08:00
  • 0f9cc6d8d3 Fix package loss for small models (#2717) Lianmin Zheng 2025-01-02 18:25:26 -08:00
  • c7ae474a49 [Feature, Hardware] Enable DeepseekV3 on AMD GPUs (#2601) yigex 2025-01-03 08:23:19 +08:00
  • bdf946bf81 Support loading pre-sharded moe weights (#2716) Lianmin Zheng 2025-01-02 15:07:37 -08:00
  • 8c8779cd05 [Fix] fix retract error in eagle speculative decoding (#2711) yukavio 2025-01-03 02:28:39 +08:00
  • 1775b963db [Fix] fix incorrectly overwriting the port specified in ServerArgs (#2714) Mick 2025-01-03 02:28:22 +08:00
  • dd2e2d275f Docs: Update documentation workflow and contribution guide (#2704) Shi Shuai 2025-01-02 17:18:31 +00:00
  • a990daff9c Included multi-node DeepSeekv3 example (#2707) Rodrigo Garcia 2025-01-02 15:17:03 +01:00
  • ba5112ff69 feat: support moe_align_block_size_triton (#2712) Yineng Zhang 2025-01-02 21:47:44 +08:00
  • 815dce0554 Eagle speculative decoding part 4: Add EAGLE2 worker (#2150) yukavio 2025-01-02 19:22:34 +08:00
  • ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709) Lianmin Zheng 2025-01-02 02:09:08 -08:00
  • 9183c23eca Speed up update_weights_from_tensor (#2695) fzyzcjy 2025-01-02 18:05:19 +08:00
  • 148254d4db Improve moe reduce sum kernel performance (#2705) kk 2025-01-02 17:11:06 +08:00
  • a4d6d6f1dd [feat]: Add math eval to CI nightly run (#2663) Xiaotong Jiang 2025-01-01 15:29:35 -08:00
  • 062c48d2bd [Docs] Add Support for Pydantic Structured Output Format (#2697) Shi Shuai 2025-01-01 23:08:43 +00:00
  • b6e0cfb5e1 ROCm base image update (#2692) kk 2025-01-01 12:12:19 +08:00
  • 0d8d97b8e6 Doc: Rename contribution_guide.md (#2691) Chayenne 2024-12-31 14:35:48 -08:00