Commit Graph

  • 3dde86194a Conditionally import HiCacheHF3FS (#8598) pansicheng 2025-08-01 05:59:29 +08:00
  • b7170cc820 [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (#8630) Trevor Morris 2025-07-31 13:57:08 -07:00
  • 5c14515fec [bug] remove pdlb from minilb since its no longer available (#8634) Simo Lin 2025-07-31 13:54:02 -07:00
  • 2cd2e27f80 SGLang HiCache NIXL Connector (#8488) Vishwanath Venkatesan 2025-07-31 15:09:42 -05:00
  • 743638bc03 misc: Remove debug print to logger.info (#8633) Chang Su 2025-07-31 12:56:52 -07:00
  • 061c8959ff Fix typos in py_test/test_launch_server.py (#6227) Michael Yao 2025-08-01 03:48:47 +08:00
  • 4acf690206 [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (#8577) Brayden Zhong 2025-07-31 14:31:21 -04:00
  • aee0ef52f5 [router] update router pypi version (#8628) Simo Lin 2025-07-31 11:24:12 -07:00
  • ae807774f5 [ci] fix genai-bench execution cmd (#8629) Simo Lin 2025-07-31 10:40:54 -07:00
  • 8fbcfd0723 Update step3v default config (#8626) Ke Bao 2025-08-01 00:49:26 +08:00
  • 3c307dc057 Fix hf3fs_fuse import error (#8623) Ke Bao 2025-07-31 22:42:31 +08:00
  • 5d15fb8c9d [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (#8611) Tao He 2025-07-31 22:41:39 +08:00
  • 016fd25127 [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (#8595) Shangming Cai 2025-07-31 21:29:34 +08:00
  • 023288645b chore: bump v0.4.10 (#8608) Yineng Zhang 2025-07-31 05:50:17 -07:00
  • 7a1f7fc504 [Feature] Hybrid EP and TP (#8590) Cheng Wan 2025-07-31 02:53:25 -07:00
  • 51c38163c1 model: support Step3V (#8583) Chang Su 2025-07-31 02:41:00 -07:00
  • 09f1a247ce fix: fork should not run pypi router (#8604) yihong 2025-07-31 17:37:13 +08:00
  • 32fa1e9cc2 [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (#8515) Cheng Wan 2025-07-31 02:34:02 -07:00
  • e7dc163f57 add SVG logo (#8603) Liangsheng Yin 2025-07-31 15:56:26 +08:00
  • e179e0b797 update sgl-kernel for EP: python part (#8550) Cheng Wan 2025-07-31 00:14:39 -07:00
  • d904959233 Support l3 cache (mooncake store) for hiradix cache (#7211) huangtingwei 2025-07-31 14:15:51 +08:00
  • 26c8a310bd fix incorrect increase of hit count (#8533) huangtingwei 2025-07-31 14:02:42 +08:00
  • 5963e50503 [bugfix] Fix 2 minor bugs in the hicache storage layer (#8404) yi wang 2025-07-31 13:47:14 +08:00
  • 43118f5f2a chore: bump sgl-kernel v0.2.8 (#8599) Yineng Zhang 2025-07-30 22:23:52 -07:00
  • a5f5ab4030 update sgl-kernel for EP: kernel part (#8514) Cheng Wan 2025-07-30 22:19:55 -07:00
  • 59aab76f0a Bug: Fix google gemma3n-mm audio input not working bug (#8365) Binyao Jiang 2025-07-30 21:23:09 -07:00
  • 659bfd1023 Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (#8544) Charles Chen 2025-07-30 20:28:07 -07:00
  • 67e53b16f5 Bump transfomers to 4.54.1 to fix Gemma cache issue. (#8541) Lifu Huang 2025-07-30 19:50:54 -07:00
  • 9b9e82539b [Fix]Fix index oob in get_group_gemm_starts kernel. (#8564) Qi Yuhang 2025-07-31 10:49:35 +08:00
  • 66a398f49d [router] migrate router from actix to axum (#8479) Simo Lin 2025-07-30 17:47:19 -07:00
  • 299803343d Add hf3fs support for hicache storage (based on #7704) (#7280) pansicheng 2025-07-31 08:42:41 +08:00
  • a79a5d7012 Revert "Fix the input tools format and history tool_calls in OpenAI API (#6556)" (#8584) Chang Su 2025-07-30 13:12:05 -07:00
  • ec5f944271 [Model] Add support for Arcee Foundational Model (#8154) Adarsh Shirawalmath 2025-07-30 23:15:25 +05:30
  • 3bdcdd134b [Hot-Fix] moe_aligned_block_size CI failed in AMD (#8461) Yuan Luo 2025-07-31 00:28:32 +08:00
  • a730ce8162 [feature] [sgl-router] Add a dp-aware routing strategy (#6869) Rui Chen 2025-07-30 20:58:48 +08:00
  • 55ecdc0a8e Update CODEOWNERS (#8562) Shangming Cai 2025-07-30 16:05:57 +08:00
  • e3f08c77bc Update cutlass_moe.py (#8545) Elfie Guo 2025-07-29 23:46:34 -07:00
  • a9fd80336d [router] allow longer time out for router e2e (#8560) Simo Lin 2025-07-29 23:43:37 -07:00
  • 2fbb754e1d feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (#8516) hzh0425 2025-07-30 12:19:25 +08:00
  • a85ebf50b8 feat(hicache): support file backend reading directory config form env. (#8498) hzh0425 2025-07-30 12:18:46 +08:00
  • 9effeb5bdd Support EPLB in FusedMoE (#8448) Cheng Wan 2025-07-29 16:02:41 -07:00
  • 1992ef9ba7 fix: temporarily disable cuda-ipc for mm data tensor (#8431) Mick 2025-07-30 06:42:03 +08:00
  • c0fd77e839 bring back kimi vl ci (#8537) Stefan He 2025-07-29 13:14:18 -07:00
  • a4c3b121d8 Split the scheduler into multiple mixin classes to reduce the file size (#8483) Lianmin Zheng 2025-07-29 12:46:50 -07:00
  • 5973675bc3 Fix moe align kernel test (#8531) Ke Bao 2025-07-30 02:03:02 +08:00
  • 4d16c88b6e Update cutlass_moe.py (#8535) Elfie Guo 2025-07-29 10:49:41 -07:00
  • 7a4309cc8a [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (#8499) Xiaoyu Zhang 2025-07-29 23:31:54 +08:00
  • 813670660c Update README.md (#8528) Lianmin Zheng 2025-07-29 04:09:19 -07:00
  • 263c9236a0 Always trigger pr-test (#8527) Lianmin Zheng 2025-07-29 04:05:19 -07:00
  • 6478831be9 chore: bump v0.4.9.post6 (#8517) Yineng Zhang 2025-07-29 02:30:07 -07:00
  • 2e1d2d7e66 Add PVC and update resource limits in k8s config (#8489) TimWang 2025-07-29 14:15:31 +08:00
  • fb16fbaf52 Fix incorrect KV cache allocation for MTP models. (#8482) Lifu Huang 2025-07-28 22:54:50 -07:00
  • 0ce84c822b Support colocating requests (#7973) fzyzcjy 2025-07-29 13:51:49 +08:00
  • 59d0bf012f Tiny add warnings for DeepEP when it is suboptimal (#8426) fzyzcjy 2025-07-29 13:51:38 +08:00
  • 7df2c0c2db Reduce memory usage for fp4 moe (#8413) fzyzcjy 2025-07-29 13:51:23 +08:00
  • 69712e6f55 Rename the last step in pr-test.yml as pr-test-finish (#8486) Lianmin Zheng 2025-07-28 19:06:13 -07:00
  • 001bffca62 Update CODEOWNERS (#8485) Lianmin Zheng 2025-07-28 17:57:23 -07:00
  • 7c9697178e [CI]Add genai-bench Performance Validation for PD Router (#8477) Keyang Ru 2025-07-28 16:58:23 -07:00
  • 8240a6b013 chore: add glm 4.5 fp8 tp4 config (#8480) Yineng Zhang 2025-07-28 16:14:01 -07:00
  • 3a04aa4be7 chore: add glm4 fp8 tp8 config (#8478) Yineng Zhang 2025-07-28 16:08:53 -07:00
  • bd51694906 Update codeowner (#8476) Lianmin Zheng 2025-07-28 16:03:49 -07:00
  • 74e7e45710 Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (#8469) Stefan He 2025-07-28 14:36:08 -07:00
  • 1466c1b896 feat: support glm4 tuning (#8473) Yineng Zhang 2025-07-28 14:32:58 -07:00
  • 9c138a0445 [3/N] MoE Refactor: Simplify DeepEP Output (#8421) Cheng Wan 2025-07-28 11:37:17 -07:00
  • c8f549d96d Fix parsing ChatCompletionMessage (#7273) Timofey 2025-07-28 21:35:14 +03:00
  • 134fa43e19 [NVIDIA] Change to use num_local_experts (#8453) Kaixi Hou 2025-07-28 10:38:19 -07:00
  • ccfe52a057 fix: update dep (#8467) Yineng Zhang 2025-07-28 10:19:33 -07:00
  • 747dd45077 feat: throttle requests at scheduler based on --max_queued_requests (#7565) harrisonlimh 2025-07-28 07:32:33 -07:00
  • b582159246 Update PR template (#8465) Ke Bao 2025-07-28 22:12:36 +08:00
  • a9dd3ec3e9 fix:reorder topk experts to ensure shared expert replaces minimal score (#8125) erictanjn 2025-07-28 20:36:46 +08:00
  • 45bc170b36 chore: bump v0.4.9.post5 (#8458) Yineng Zhang 2025-07-28 02:11:06 -07:00
  • 2262369905 Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (#8457) Xiaoyu Zhang 2025-07-28 16:35:43 +08:00
  • fb4ce17de6 Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (#8449) strgrb 2025-07-28 16:32:46 +08:00
  • 25f73c6cf3 fix GLM4_MOE launch with compressed_tensor quant model (#8456) Minglei Zhu 2025-07-28 01:31:20 -07:00
  • 581e7dcb92 GLM-4.5 Model Support Follow-up (#8445) Binyao Jiang 2025-07-27 23:35:20 -07:00
  • 484d0e021d doc: add bench_one_batch_server in the benchmark doc (#8441) Qiaolin Yu 2025-07-27 23:07:54 -07:00
  • 5922c0cbf6 Remove zstd compression for building Dockerfile.gb200 (#8442) kyleliang-nv 2025-07-27 22:58:53 -07:00
  • 6d6a8bc278 GLM-4.5 Model Support (#8224) Yuxuan Zhang 2025-07-28 13:54:07 +08:00
  • 2fd5c7049f [PD] Fix abort_request for PD disaggregation (#8352) Shangming Cai 2025-07-28 12:48:27 +08:00
  • 4ad9737045 chore: bump transformer to 4.54.0 (#8416) Stefan He 2025-07-27 21:27:25 -07:00
  • 2810338401 [feat] Support different attention backends for prefill and decode (#6338) Qiaolin Yu 2025-07-27 20:42:29 -07:00
  • fe6a445d1e [router] improve router logs and request id header (#8415) Simo Lin 2025-07-27 19:30:19 -07:00
  • dd487e5553 bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (#8422) Chang Su 2025-07-27 19:01:02 -07:00
  • bb81daefb8 Fix docker buildx push error (#8425) kyleliang-nv 2025-07-27 17:59:38 -07:00
  • 58dd95fbc8 Fix test_openai_server (#8419) Chang Su 2025-07-27 13:36:01 -07:00
  • b47eda3316 bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (#8417) Chang Su 2025-07-27 13:31:06 -07:00
  • e983d66680 Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (#8316) Binyao Jiang 2025-07-27 13:12:59 -07:00
  • b58c3c285e Support ue8m0 for triton quant kernel (#7603) fzyzcjy 2025-07-28 04:04:35 +08:00
  • df90645525 Support overlapped lora updates (#8213) Lifu Huang 2025-07-27 13:00:44 -07:00
  • 95217a9b4d Change to use native arm runner (#8414) kyleliang-nv 2025-07-27 12:48:12 -07:00
  • 22e00eeb4a [Bugfix] Prevent PD server crash from invalid grammar (#8062) Shangming Cai 2025-07-28 00:17:51 +08:00
  • b3eac168e7 Support triton kernels v3.4.0 for fused_moe (#8258) Yuan Luo 2025-07-27 17:28:49 +08:00
  • 10ee89559e chore: upgrade flashinfer v0.2.9rc2 (#8406) Yineng Zhang 2025-07-27 01:41:22 -07:00
  • bf3352c559 chore: update CODEOWNERS (#8407) Yineng Zhang 2025-07-27 01:39:36 -07:00
  • 4d921f2b79 [hotfix] fix merge conflicts in FlashInferEPMoE (#8405) Cheng Wan 2025-07-27 01:24:10 -07:00
  • 44d600cd67 Support precomputed_embeddings for Llama 4 (#8156) Kevin Xiang Li 2025-07-27 01:14:49 -07:00
  • 5c9c275bc8 Use FlashInfer FP4 gemm. (#8241) Elfie Guo 2025-07-27 01:05:22 -07:00
  • bf0f448fe5 [2/N] MoE Refactor: Unify weight loader and quant methods (#8397) Cheng Wan 2025-07-27 01:00:21 -07:00
  • 36d6f0ba5b fix: fix the missing metrics on non-rank0 nodes (#7720) Yingchun Lai 2025-07-27 15:55:25 +08:00
  • 2a1936de96 Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (#8351) Li Hui 2025-07-27 15:46:07 +08:00