Commit Graph

  • c951d312ed [PD] Fix large page size + chunk prefill (#5588) Byron Hsu 2025-04-20 17:21:54 -07:00
  • dcb8232596 Fix ChatCompletionMessageGenericParam to allow for None content (#5452) AmadeusW 2025-04-21 08:15:38 +08:00
  • 66c0ff9e31 fix: use fa3 for gemma2 (#5586) Yineng Zhang 2025-04-20 17:02:09 -07:00
  • 9a7e83e899 Fix enable chunked prefill for Llama4 (#5575) tarinkk 2025-04-20 20:01:30 -04:00
  • 417b44eba8 [Feat] upgrade pytorch2.6 (#5417) lukec 2025-04-21 07:06:34 +08:00
  • 475e2e378a [PD] Fix server crash when using batch requests (#5531) fzyzcjy 2025-04-21 07:02:23 +08:00
  • fba86b6b54 Tiny improve error message (#5526) fzyzcjy 2025-04-21 07:00:15 +08:00
  • 072b4d0398 Add document for LoRA serving (#5521) Baizhou Zhang 2025-04-20 14:37:57 -07:00
  • 9c43477710 Super tiny fix typo (#5559) fzyzcjy 2025-04-21 05:21:18 +08:00
  • fa2f677e18 Fix torch memory saver not enabled in DP scenario (#5560) fzyzcjy 2025-04-21 05:20:52 +08:00
  • 463d4b7400 Fix DeepEP cannot run on latest master (#5567) fzyzcjy 2025-04-21 05:19:42 +08:00
  • 9924bbe153 Fix bench_serving fail when zero warmup requests (#5574) fzyzcjy 2025-04-21 05:16:03 +08:00
  • fbdc94ba59 Release v0.4.5.post2 (#5582) Lianmin Zheng 2025-04-20 14:12:37 -07:00
  • b54b5a96e4 [Doc]Add instruction for profiling with bench_one_batch (#5581) Baizhou Zhang 2025-04-20 14:05:36 -07:00
  • bca832c7c6 [Fix] fix outlines and xgrammar (#4947) JieXin Liang 2025-04-21 04:31:25 +08:00
  • d9dd529854 enable DeepSeek V3 shared_experts_fusion in sm90 (#5571) Xiaoyu Zhang 2025-04-21 03:46:42 +08:00
  • 0a0dd34e6a Fix BumpAllocator error when no input_ids (#5564) fzyzcjy 2025-04-20 17:20:53 +08:00
  • 80ac527d22 [PD] Fix DeepSeek cannot be run on latest master (#5568) fzyzcjy 2025-04-20 17:19:48 +08:00
  • 99456bcacb [perf] introduce deep gemm group_gemm_masked as bmm (#5432) JieXin Liang 2025-04-20 15:38:27 +08:00
  • d07e797ace Fix bench_one_batch producing unnatural results for expert parallel (#5149) fzyzcjy 2025-04-20 15:38:04 +08:00
  • c555d794f7 Minor update for ROCm variable style (#5562) Zhaoyi Li 2025-04-20 01:45:27 -05:00
  • e2574ee986 fix hicache write back (#5543) Zhiqiang Xie 2025-04-19 21:56:22 -07:00
  • ab4b5606e4 [PD] Support page size > 1 (#5561) Byron Hsu 2025-04-19 21:54:27 -07:00
  • 20f1c8e374 Fix sampler nan check when calling top_k_top_p_sampling_from_probs (#5546) Yubo Wang 2025-04-19 21:47:23 -07:00
  • 613b197e57 Remove one kernel in per_tensor_quant_mla_fp8 (#5549) fzyzcjy 2025-04-20 06:08:15 +08:00
  • d58e354472 simplify the control logic for using shared experts fusion (#5504) Xiaoyu Zhang 2025-04-20 04:17:35 +08:00
  • bf86c5e990 restruct compressed_tensors_w8a8_fp8 (#5475) Xiaoyu Zhang 2025-04-19 19:52:15 +08:00
  • dca90f1db8 [PD] Remove the requirement of config file for mooncake backend (#5460) shangmingc 2025-04-19 19:31:00 +08:00
  • 0961feefca feat: use flashinfer jit package (#5547) Yineng Zhang 2025-04-19 00:28:39 -07:00
  • 59dd090f1c [PD] Fix no cache connect for recevier (#5534) ybyang 2025-04-19 14:55:28 +08:00
  • 569b032c58 [PD] Tiny fix timeout error when generate (#5545) fzyzcjy 2025-04-19 14:42:57 +08:00
  • f6a71139a8 Make profiler output file names consistent (#5548) fzyzcjy 2025-04-19 13:57:11 +08:00
  • 1e0806f30b Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (#5340) fzyzcjy 2025-04-19 13:38:07 +08:00
  • 2c11f9c2eb chore: upgrade sgl-kernel 0.0.9.post2 (#5540) Yineng Zhang 2025-04-18 21:17:23 -07:00
  • a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) Yineng Zhang 2025-04-18 16:50:21 -07:00
  • 08b518d51f fix util import (#5542) Yineng Zhang 2025-04-18 15:06:46 -07:00
  • 4db463b1ad [Model] Adding Qwen3 and Qwen3MoE (#4693) yhyang201 2025-04-19 00:51:29 +08:00
  • bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476) Wenxuan Tan 2025-04-18 03:13:57 -05:00
  • e465b08ddb fix bug of VLLM_AVAILABLE not defined (#5497) liwenju0 2025-04-18 15:59:03 +08:00
  • bed05878f6 fix kimi vl running bug after rebase main (#5461) Xiaoyu Zhang 2025-04-18 15:17:34 +08:00
  • b2a189dd11 use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (#5473) strgrb 2025-04-18 15:05:24 +08:00
  • f28d82997a chore: bump sgl-kernel 0.0.9.post2 (#5518) Yineng Zhang 2025-04-17 23:42:39 -07:00
  • 8e09b37077 Sgl kernel fused_moe_gate support n_shared_experts (#5440) Xiaoyu Zhang 2025-04-18 14:05:15 +08:00
  • 53dcf38876 Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (#4836) fzyzcjy 2025-04-18 12:38:26 +08:00
  • 1effba4c70 Configuration qwen2_moe.py - qkv_bias now in transformers (#5512) Michael Feil 2025-04-17 21:23:22 -07:00
  • a0fc5bc144 [docs] Fix several consistency issues in sampling_params.md (#5373) Michael Yao 2025-04-18 10:54:40 +08:00
  • 27e9538a7e Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (#5426) mlmz 2025-04-18 10:51:39 +08:00
  • 211c7b31b8 Fix: Incorrect parameters passed to forward_batch_generation (#5506) (#5511) u4lr451 2025-04-18 09:49:59 +08:00
  • c08a717c77 [Feat] Update sgl-kernel flashinfer to latest main version (#5500) PGFLMG 2025-04-18 03:43:23 +08:00
  • f13d65a7ea Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (#5503) mlmz 2025-04-18 02:37:43 +08:00
  • 06d0a3d92b [Bug fix] use correct func path in deepseek (#5496) Xuchun Shang 2025-04-17 17:41:41 +08:00
  • 22c2a79dc5 Fix a link in sgl-kernel/README.md (#5493) Michael Yao 2025-04-17 17:25:28 +08:00
  • 8beb356f0d Refactor DeepSeek decoder layer branches (#5205) fzyzcjy 2025-04-17 17:11:11 +08:00
  • c776234b45 Enable local attention during decode (#5479) Chang Su 2025-04-17 02:07:43 -07:00
  • 3bface15e6 Feat/support encoder model (like bert) (#4887) woodx 2025-04-17 16:50:48 +08:00
  • 6fb29ffd9e Deprecate enable-flashinfer-mla and enable-flashmla (#5480) Baizhou Zhang 2025-04-17 01:43:33 -07:00
  • 4fb05583ef Deprecate disable-mla (#5481) Baizhou Zhang 2025-04-17 01:43:14 -07:00
  • 81c891111f Add test for flash_attn_varlen_func kernel (#5484) Baizhou Zhang 2025-04-17 01:42:56 -07:00
  • 92d1561b70 Update attention_backend.md: plural form (#5489) Didier Durand 2025-04-17 10:42:40 +02:00
  • 8f783c1943 [Model Support] unsloth/Phi-4-mini bnb model (#4982) eigen 2025-04-16 22:58:20 -04:00
  • 90faf9018e [verl] Modify the update_weights func to align with verl's resharding (#5345) BearBiscuit 2025-04-17 10:56:57 +08:00
  • 177320a582 Clean up imports (#5467) Lianmin Zheng 2025-04-16 15:26:49 -07:00
  • d7bc19a46a add multi-lora feature in README.md (#5463) Ying Sheng 2025-04-16 03:25:25 -07:00
  • 85ec0440a5 Update cutlass dependency. (#5447) Elfie Guo 2025-04-15 23:28:04 -07:00
  • 06a1656e02 [doc] Update benchmark_and_profiling.md (#5449) Xiaoyu Zhang 2025-04-16 14:27:34 +08:00
  • 6aca583420 Fix several minor issues in PD disaggregation (#5444) Cheng Wan 2025-04-15 23:04:41 -07:00
  • 5b5c7237c8 chore: bump v0.4.5.post1 (#5445) Yineng Zhang 2025-04-15 23:00:07 -07:00
  • a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) Baizhou Zhang 2025-04-15 22:01:22 -07:00
  • dd83e7e9c3 [Bug fix] need record start time in pd mode (#5425) ybyang 2025-04-16 10:11:16 +08:00
  • 0769b14bf9 [Minor] Move torch.compile patch to a better place (#5397) Lianmin Zheng 2025-04-15 18:37:07 -07:00
  • b64b88e738 [Docs] Update start/install.md (#5398) Michael Yao 2025-04-16 09:12:26 +08:00
  • bc24205b32 Support BNB quantization for llama/mllama (#5038) ryang 2025-04-16 09:00:31 +08:00
  • 3efc8e2d2a add attention backend supporting matrix in the doc (#5211) mRSun15 2025-04-15 17:16:34 -07:00
  • 27a009bb00 Fix ignore_eos parameter when loading a chat template (#5264) Chang Su 2025-04-15 17:09:45 -07:00
  • 8ec0bb7d55 chore: upgrade sgl-kernel 0.0.9.post1 (#5436) Yineng Zhang 2025-04-15 15:45:51 -07:00
  • fa909dc3c4 feat: update model_specific_adjustment (#5344) Yineng Zhang 2025-04-15 14:45:15 -07:00
  • e8f62b20ca BLackwell cutlass mla: Add check for bad page size/block num combinations (#5431) Trevor Morris 2025-04-15 14:07:42 -07:00
  • 88defc4d89 fix: solve release issue (#5434) Yineng Zhang 2025-04-15 12:58:11 -07:00
  • 6f509d5503 chore: bump sgl-kernel v0.0.9.post1 (#5430) Yineng Zhang 2025-04-15 11:00:21 -07:00
  • 12ef7e3bc3 bugfix: fix merge_state_v2 cuda graph (#5419) DefTruth 2025-04-16 01:18:47 +08:00
  • 838fa0f218 [minor] cleanup cmakelists.txt (#5420) Lianmin Zheng 2025-04-15 07:07:07 -07:00
  • f1b3b75fc6 [PD] Remove unused bootstrap param and fix port table type (#5423) shangmingc 2025-04-15 21:21:20 +08:00
  • 33b16ad178 Distinguish bootstrap key only in decode server (#5422) Liangsheng Yin 2025-04-15 20:59:28 +08:00
  • ffde65a094 [PD] Fix dynamic port support and MLA buffer for Mooncake (#5415) shangmingc 2025-04-15 19:29:31 +08:00
  • 471650dee0 Fix broadcast use cuda device lead to memory capacity unbalanced (#5416) lambert0312 2025-04-15 17:47:26 +08:00
  • d06a83fb01 Support dynamic connection and TP 16 (#5351) Yuan Luo 2025-04-15 17:08:07 +08:00
  • 5d13440162 [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (#5412) Zhaoyang Hao 2025-04-15 16:42:27 +08:00
  • f88f7e1943 [misc] fix ci flaky case (#5352) JieXin Liang 2025-04-15 16:37:16 +08:00
  • 3dfc6023ce Fix bench_serving with random-ids (#5214) Yuhong Guo 2025-04-15 16:34:35 +08:00
  • 15e91d721b Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (#5406) fzyzcjy 2025-04-15 16:33:47 +08:00
  • 8aab7fdb21 chore: upgrade sgl-kernel 0.0.9 (#5401) Yineng Zhang 2025-04-14 22:37:59 -07:00
  • e940dc4f06 chore: bump sgl-kernel 0.0.9 (#5400) Yineng Zhang 2025-04-14 21:34:04 -07:00
  • 388e15c0db kernel: support slightly faster merge_state_v2 cuda kernel (#5381) DefTruth 2025-04-15 12:28:23 +08:00
  • 11421a3f44 fix: update pr-test-sgl-kernel (#5399) Yineng Zhang 2025-04-14 21:14:59 -07:00
  • 6c41fcf0e4 chore: upgrade DeepGEMM (#5395) Yineng Zhang 2025-04-14 20:32:46 -07:00
  • ee9d6ca677 [fix/misc] remove duplicate row in deepseek v2 model (#5279) Yangcheng Li 2025-04-15 09:41:24 +08:00
  • 2dd6489468 Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5291) Ximingwang-09 2025-04-15 09:40:31 +08:00
  • 61e7c4dd21 Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5368) lambert0312 2025-04-15 09:39:44 +08:00
  • dae7944440 minor clean up of sgl-kernel/CMakeLists.txt (#5393) Lianmin Zheng 2025-04-14 18:38:44 -07:00
  • f6772f1497 [Fix] Turn off DeepGEMM by default (#5263) Baizhou Zhang 2025-04-14 17:45:44 -07:00