Commit Graph

  • 40e3b2beeb feat: add trtllm-gen mha from direct call (#8782) eigen 2025-08-05 06:28:39 -04:00
  • 75df31b60e chore: bump sgl-kernel v0.3.2 (#8802) Yineng Zhang 2025-08-05 02:35:20 -07:00
  • 194561f27a feat: support sgl-kernel cu129 (#8800) Yineng Zhang 2025-08-05 02:33:47 -07:00
  • 5e91fed1c5 Revert "[NVIDIA]Fix local_num_experts for EP (#8779)" (#8797) Yineng Zhang 2025-08-04 23:30:43 -07:00
  • 873f384a51 [feat] Add detail in image_data (#8596) Yuhao Yao 2025-08-05 14:01:38 +08:00
  • b01eeb80f8 [NVIDIA]Fix local_num_experts for EP (#8779) Shu Wang 2025-08-05 00:01:14 -05:00
  • 1ea94d3b92 chore: upgrade flashinfer v0.2.9 (#8780) Yineng Zhang 2025-08-04 21:59:18 -07:00
  • 354ac43555 [pd-router] Add Configurable Retry Logic for reduce backend pressure (#8744) Simo Lin 2025-08-04 20:42:07 -07:00
  • d98a4913ea [PD] Refactor parallel sizes and add pp support for mooncake (#8571) Shangming Cai 2025-08-05 11:18:11 +08:00
  • 08f8f49016 [CPU][sgl-kernel] biased_grouped_topk: fix correction_bias dtype to float32 (#8212) Chunyuan WU 2025-08-05 09:28:31 +08:00
  • d4bf5a8524 Support OCP MXFP4 quantization on AMD GPUs (#8255) kk 2025-08-05 09:14:52 +08:00
  • 7cb20754fa [Fix] Fix several issues preventing gemma3n LoRA support. (#8776) Lifu Huang 2025-08-04 17:11:46 -07:00
  • 6d0646da11 [NVIDIA] Fix breakage of using trtllm-gen fp8 moe (#8773) Kaixi Hou 2025-08-04 16:30:13 -07:00
  • 02bc1c7d80 chore: bump sgl-kernel v0.3.1 (#8771) Yineng Zhang 2025-08-04 13:18:54 -07:00
  • fc8c8e5041 Integrate triton_kernels in sgl-kernel (#8762) Qiaolin Yu 2025-08-04 12:12:14 -07:00
  • 9bd4872a34 [bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' (#8768) Trevor Morris 2025-08-04 11:08:08 -07:00
  • 2fa0462c39 [router] introduce dp worker abstraction (#8639) Simo Lin 2025-08-04 06:42:20 -07:00
  • 915140fd18 [NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (#8552) azhurkevich 2025-08-04 03:10:02 -07:00
  • 36fc9260a2 [bugfix] fix import path in HiCacheController (#8749) Baron Liu 2025-08-04 13:19:15 +08:00
  • fee0ab0fba [CI] Ascend NPU CI enhancement (#8294) Even Zhou 2025-08-04 13:16:38 +08:00
  • f57d2dc162 [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (#8738) Xiaoyu Zhang 2025-08-04 12:55:57 +08:00
  • f2d68ded6d Rename lora_path to lora_id in batches (#8437) Baizhou Zhang 2025-08-03 21:08:28 -07:00
  • 3b87a9e8ae Fix bug of refactoring TopKOutput in w4afp8 (#8745) Yuan Luo 2025-08-04 11:05:02 +08:00
  • f024795e57 Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (#8733) YyWangCS 2025-08-04 10:02:51 +08:00
  • b102353f8f [MoE] Enable renormalize=False in Triton kernels (#8735) Cheng Wan 2025-08-03 17:03:04 -07:00
  • 7a27e798ca [CI] Do not trigger pd-disaggregation CI in draft PR (#8737) Liangsheng Yin 2025-08-04 05:12:20 +08:00
  • 76ba5bbe12 fix args typo in memory_pool_host (#8662) huangtingwei 2025-08-04 04:47:29 +08:00
  • ed6f7597b3 Fix the missing 'lof' choice of --schedule-policy server args (#7114) Yingchun Lai 2025-08-04 03:29:42 +08:00
  • e67276ecb3 feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (#8678) tql.99 2025-08-04 01:47:15 +08:00
  • 0242bb9c74 Fix triton kernels topk with keyword arguments (#8732) Ke Bao 2025-08-04 01:45:15 +08:00
  • 760286e3d3 use fp32 for e_score_correction_bias in GLM-4.5 (#8729) Yuxuan Zhang 2025-08-04 01:43:40 +08:00
  • 3435a24e81 [RL] fix update weight for FusedMoE with EP (#8676) Zilin Zhu 2025-08-04 01:20:39 +08:00
  • 00da906584 feat: Support DP Attention for step3_vl (#8699) yhyang201 2025-08-03 19:35:26 +08:00
  • 8cd344586e chore: bump v0.4.10.post2 (#8727) Yineng Zhang 2025-08-03 03:43:29 -07:00
  • 0e0eef00ce [DP] fix the compatibility issue between DP attention and --attention-backend triton (#8723) Cheng Wan 2025-08-03 03:06:57 -07:00
  • cb099d2095 [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (#8579) Cheng Wan 2025-08-03 03:06:47 -07:00
  • 7a91330149 Save cuda graph memory for fa3 (#8567) Cheng Wan 2025-08-03 03:06:31 -07:00
  • 5ce5093b97 chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (#8718) Yineng Zhang 2025-08-03 02:31:50 -07:00
  • 6f9baf1002 [Improvements] Merge health check route (#8444) ybyang 2025-08-03 16:59:06 +08:00
  • a31b7a7024 feat: Add new moe triton for NVIDIA RTX 6000 Ada (#8547) Jasper James 2025-08-03 15:57:35 +08:00
  • 7ed8e51bc3 [fix] Fix divide by zero error for llama4. (#8683) Varun Vinayak Shenoy 2025-08-03 00:55:55 -07:00
  • 32f2815451 Do layernorm before allgather for DP attention (#8631) Trevor Morris 2025-08-03 00:53:08 -07:00
  • f7b2853ff8 [feat] support minimum token load balance in dp attention (#7379) Guanhua Wang 2025-08-03 15:46:47 +08:00
  • b0add2da00 HiCache storage, style change and bug fix (#8719) Zhiqiang Xie 2025-08-03 00:05:04 -07:00
  • 0305c5053f Reduce memory accumulation in long-running server (#8306) Wenxuan Tan 2025-08-03 02:03:16 -05:00
  • 8675bdf246 Support limiting max loaded loras in CPU. (#8650) Lifu Huang 2025-08-03 00:02:23 -07:00
  • a437aa9987 [hotfix] fix mixtral with tensor-level compressed-tensor quantization (#8721) Cheng Wan 2025-08-02 22:59:25 -07:00
  • 0e612dbf12 Tiny fix CI pytest error (#8524) fzyzcjy 2025-08-03 13:48:42 +08:00
  • 9f47d686e5 Fix fused MoE when routed_scaling_factor is None (#8709) Liangsheng Yin 2025-08-03 12:42:01 +08:00
  • d9def43dcd [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (#8722) Qi Yuhang 2025-08-03 12:13:47 +08:00
  • e273aa6dcf [Feature] Radix Tree in C++ (#7369) DarkSharpness 2025-08-02 19:50:14 -07:00
  • 828a4fe944 [router] Implement HTTP Dependency Injection Pattern for Router System (#8714) Simo Lin 2025-08-02 19:16:47 -07:00
  • 8ada1ab6c7 Fix triton moe error caused by TopK refactor (#8705) fzyzcjy 2025-08-03 09:49:47 +08:00
  • e314b084c5 [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (#8693) Lianmin Zheng 2025-08-02 18:43:14 -07:00
  • 403566bcca Remove assertions about per group quant fp8 (#8717) fzyzcjy 2025-08-03 08:08:40 +08:00
  • 0a56b721d5 chore: bump sgl-kernel v0.2.9 (#8713) Yineng Zhang 2025-08-02 16:21:56 -07:00
  • 603f5ce020 [Bug] fix green context's incompatibility with cuda < 12.4 (#8701) Liangsheng Yin 2025-08-03 06:23:11 +08:00
  • 6d4fd8826e [router] minor code clean up and and refactoring (#8711) Simo Lin 2025-08-02 13:46:31 -07:00
  • f9f0138f80 Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (#8706) Liangsheng Yin 2025-08-02 20:14:30 +08:00
  • ac6962ccd6 [Doc] Polish sgl-kernel readme for cu126 build error (#8704) PGFLMG 2025-08-02 17:03:07 +08:00
  • 4ca43b061c Add tensor.detach() back to update weight util (#8691) Stefan He 2025-08-02 00:41:05 -07:00
  • ea93079b30 model: adapt mllama4 to VisionAttention (#8512) Wenchen Lo 2025-08-02 00:39:40 -07:00
  • 4bec99ecd0 Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (#7434) Yusong Gao 2025-08-02 14:43:45 +08:00
  • 89caf7a3c6 [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (#8688) Trevor Morris 2025-08-01 19:00:24 -07:00
  • b27b11919c chore(gb200): update dockerfile to handle fp4 disaggregation (#8694) ishandhanani 2025-08-01 18:58:00 -07:00
  • f642524fd9 [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (#8364) Trevor Morris 2025-08-01 18:14:24 -07:00
  • 82e6c3a65a Add support for NCCL symmetric memory for TP allreduces (#8238) Nicolas Castet 2025-08-01 18:30:55 -05:00
  • b89d37cb11 [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (#8618) Baron Liu 2025-08-02 07:02:53 +08:00
  • 5deab1283a upgrade xgrammar 0.1.22 (#8522) Swipe4057 2025-08-02 01:59:15 +03:00
  • d1c4d51c08 bugfix(hicache): Fix 'MooncakeStore' not defined error. (#8668) hzh0425 2025-08-02 06:58:17 +08:00
  • 1fe691a429 Fix FP8 block quantization when N or K is not multiples of 128 (#8648) YanbingJiang 2025-08-02 06:57:19 +08:00
  • e252192679 Fix deepgemm masked grouped gemm jit compile (#8679) Ke Bao 2025-08-02 06:37:59 +08:00
  • 07e46ecaad Update CODEOWNERS (#8686) Lianmin Zheng 2025-08-01 15:09:44 -07:00
  • ab9b893e61 [bug] limit bootstrap room to to [0, 2^63 - 1] (#8684) Simo Lin 2025-08-01 14:41:01 -07:00
  • 6a7528e623 [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (#8685) Trevor Morris 2025-08-01 14:28:04 -07:00
  • 2ae95d17e8 Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) (#8647) Minglei Zhu 2025-08-01 12:02:35 -07:00
  • 2d401bd99d [fix] fix pd disagg error of vlms (#8094) 萝卜菜 2025-08-02 02:16:29 +08:00
  • b17c5b0118 fix arg typo for --disaggregation-transfer-backend (#8664) Zac 2025-08-02 01:00:47 +08:00
  • db7343c992 fix per token cuda kernel hidden dim cannot divide by 16 (#8543) Stefan He 2025-08-01 09:27:18 -07:00
  • 533cb5b274 [DOC]Update sgl-kernel README (#8665) Hongbo Xu 2025-08-01 22:59:27 +08:00
  • 6bdd27861b [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (#8013) Peter Pan 2025-08-01 22:01:24 +08:00
  • 46e9d1c7c1 Increase tolerance to address CI failures (#8643) Lifu Huang 2025-08-01 02:32:10 -07:00
  • 6c88f6c8d9 [5/N] MoE Refactor: Update MoE parallelism arguments (#8658) Cheng Wan 2025-08-01 01:20:03 -07:00
  • c8d3a402c1 Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (#8511) Binyao Jiang 2025-08-01 00:07:41 -07:00
  • 7e831efee8 Fix chat template handling for OpenAI serving (#8635) Xinyuan Tong 2025-07-31 21:49:45 -07:00
  • 20b5563eda Add hf3fs_utils.cpp to package-data (#8653) pansicheng 2025-08-01 12:41:09 +08:00
  • 33f0de337d chore: bump v0.4.10.post1 (#8652) Ke Bao 2025-08-01 12:07:30 +08:00
  • e7e5a3050a Update batch size limitation of dsv3_router_gemm kernel to 16 (#8051) Baizhou Zhang 2025-07-31 20:53:31 -07:00
  • dd7ca00601 Interface change for kvcache io to support page first layout (#8318) Zhiqiang Xie 2025-07-31 20:37:49 -07:00
  • 9305ea6c2d HiCache, fixing hash value indexing (#8636) Zhiqiang Xie 2025-07-31 20:29:51 -07:00
  • aa4c66b564 [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (#8450) Kaixi Hou 2025-07-31 19:56:34 -07:00
  • 39decec10b [router] upgrade router version to 0.1.8 (#8645) Simo Lin 2025-07-31 19:00:23 -07:00
  • f6f46f4629 [router] add basic usage doc (#8640) Simo Lin 2025-07-31 18:11:48 -07:00
  • 2886e23dbd [bugfix] fix router python parser for pd urls (#8644) Simo Lin 2025-07-31 18:09:31 -07:00
  • 99795d61e6 [Bugfix] fix w8a8_int8 load issue (#8308) Even Zhou 2025-08-01 08:30:16 +08:00
  • fe5086fd8b chore: speedup NPU CI by cache (#8270) li chaoran 2025-08-01 08:29:50 +08:00
  • 04913430c6 Feature/modelscope model download (#8083) yrk111222 2025-08-01 08:29:31 +08:00
  • 0ad098b494 Revert "Fix nan value generated after custom all reduce (#8532)" (#8642) Yineng Zhang 2025-07-31 17:26:49 -07:00
  • 4a6e7a66a0 Fix nan value generated after custom all reduce (#8532) kk 2025-08-01 07:15:43 +08:00
  • 4b04998d38 TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632) Faraz 2025-07-31 19:03:40 -04:00