Commit Graph

  • 442534aa44 Add CI for gpt-oss model on hopper (#8851) fzyzcjy 2025-08-09 15:34:23 +08:00
  • de8b8b6e5c chore(deps): update minimum python to 3.10 (#8984) ishandhanani 2025-08-09 00:30:23 -07:00
  • 3f2e315f6e optimize: reduce shulffle and quantization overhead in cutlass_moe sm90 (#8962) tql.99 2025-08-09 15:29:12 +08:00
  • 6e2151183b Fix incorrect default get_hidden_dim logic (#8987) Lifu Huang 2025-08-09 00:25:38 -07:00
  • a47baff12c [hotfix] use the original implementation in 8785 (#8994) Cheng Wan 2025-08-08 21:47:25 -07:00
  • fd7e15b76d Revert "[bug fix] Ensure local token and global token buffers are pointing to different storage " (#8993) Cheng Wan 2025-08-08 21:34:17 -07:00
  • fc42ff7b63 [Fix] Fix wrong backend chosen in hybrid backend (#8989) DarkSharpness 2025-08-08 21:21:17 -07:00
  • 7c0db868a1 Molly/ci gnr server (#8667) DiweiSun 2025-08-09 11:01:16 +08:00
  • 706bd69cc5 Clean up server_args.py to have a dedicated function for model specific adjustments (#8983) Lianmin Zheng 2025-08-08 19:56:50 -07:00
  • 23f2afb2ce [AMD] Update SGLang image fallback logic for AMD CI (#8980) michael-amd 2025-08-08 18:51:29 -07:00
  • a60f88b5a4 Add unit test for flashinfer fp4 moe (#8330) Trevor Morris 2025-08-08 17:55:37 -07:00
  • 591c232f7c [1/2][resubmit] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (select_experts) (#8770) Trevor Morris 2025-08-08 17:55:06 -07:00
  • f352b793be Minor Optimizations in Schedule Batch (#8724) Lianmin Zheng 2025-08-08 16:10:16 -07:00
  • 6642e3a295 [Fix] Add a workflow to cancel all pending CI runs (#8988) Lianmin Zheng 2025-08-08 16:09:50 -07:00
  • 67a7d1f699 Create cancel-all-pr-test-runs (#8986) Lianmin Zheng 2025-08-08 15:53:51 -07:00
  • 92cbef59ec [bug fix] Ensure local token and global token buffers are pointing to different storage (#8785) Elfie Guo 2025-08-08 15:13:32 -07:00
  • b3359dc9bf Update qwen3_coder_detector.py for streaming (#8371) maocheng23 2025-08-08 14:51:03 -07:00
  • 7b7e56150e [router] fix radix tree integration issues in PD router (#8982) Simo Lin 2025-08-08 14:47:51 -07:00
  • 1a8706c8b9 [router] reduce contention, fix double-count race (#8978) Simo Lin 2025-08-08 14:29:14 -07:00
  • 7d3af603e7 chore(ci): update Python version from 3.9 to 3.10 in sgl-kernel workflow (#8981) ishandhanani 2025-08-08 14:03:17 -07:00
  • 4e7f025219 chore(gb200): update to CUDA 12.9 and improve build process (#8772) ishandhanani 2025-08-08 13:42:47 -07:00
  • 36bfddecb9 [router] add metrics for worker and policy (#8971) Tony Lu 2025-08-09 04:41:40 +08:00
  • 91e2f902db Fix kimi k2 function call format (#8968) Lianmin Zheng 2025-08-08 13:25:14 -07:00
  • a59cbea92d [router] harden retries + metrics; fix streaming load; header filtering (#8972) Simo Lin 2025-08-08 13:10:14 -07:00
  • 53f7874ae6 refine aiter_backend for mtp (#7279) valarLip 2025-08-09 02:06:02 +08:00
  • 61a4680494 [router] router circuit breaker core (#8941) Simo Lin 2025-08-08 09:20:22 -07:00
  • 9020f7fc32 chore: bump v0.5.0rc0 (#8959) Yineng Zhang 2025-08-08 09:16:18 -07:00
  • dd650e0e21 [RL] fix skip_server_warmup and rl health_generate logic (#8757) Zilin Zhu 2025-08-08 19:34:38 +08:00
  • a947154286 Revert "Support Multi Process Tokenizer Manager" (#8960) Lianmin Zheng 2025-08-08 02:28:27 -07:00
  • 41357e511b chore: update flashinfer (#8958) Yineng Zhang 2025-08-08 02:15:22 -07:00
  • e2fd2b9c7e Simple prefetch policy (#8692) pansicheng 2025-08-08 17:09:28 +08:00
  • 7490e3f67d Support Multi Process Tokenizer Manager (#6555) ybyang 2025-08-08 16:45:50 +08:00
  • 6ee6619b7a add zai-org/GLM-4.5-Air-FP8 model into nightly CI (#8894) Minglei Zhu 2025-08-08 01:44:19 -07:00
  • 54ea57f245 chore: bump sgl-kernel v0.3.3 (#8957) Yineng Zhang 2025-08-08 01:35:37 -07:00
  • b4c9f38a76 [NVIDIA] Fix missing get_col_major_tma_aligned_tensor for Blackwell deepgemm in EpMoE (#8955) Kaixi Hou 2025-08-08 01:12:33 -07:00
  • 1132547496 Add ernie4.py for ERNIE-4.5 (#7657) Wenbo Yang 2025-08-08 15:55:48 +08:00
  • 1d24db8348 Expert Parallelism for GPT-OSS (#8944) Cheng Wan 2025-08-08 00:46:42 -07:00
  • 444013585d Fix typos and unify size(s)/stride(s) API calls (#8799) triple-mu 2025-08-08 15:18:08 +08:00
  • 9c7e392465 bench: add attention sink op benchmark, triton and trtllm-gen [B200] (#8932) eigen 2025-08-08 03:16:23 -04:00
  • 08fab2b0c4 minor: global workspace buffer for trtllm-gen mha from flashinfer (#8952) eigen 2025-08-08 03:12:12 -04:00
  • 0d1e27a0c5 Better optimization log for gpt-oss model (#8953) Xiaoyu Zhang 2025-08-08 15:11:48 +08:00
  • 774b47f3f1 Reduce scheduler recv requests overhead (#8947) fzyzcjy 2025-08-08 15:10:05 +08:00
  • 76915d68a8 Fix enable flashinfer mxfp4 moe bf16 check (#8950) Xiaoyu Zhang 2025-08-08 13:52:09 +08:00
  • 39fd178831 refactor: Move scalar_types.py to sgl-kernel to avoid circular import (#8720) Hongbo Xu 2025-08-08 10:22:16 +08:00
  • ed0a3dd54a Enhancements for bench_one_batch (#8703) Zaili Wang 2025-08-08 10:00:31 +08:00
  • 2e901e892f [router] dedicated prefill HTTP client and request-path optimizations (#8923) Simo Lin 2025-08-07 17:31:45 -07:00
  • d3be97104b correct the tp_plan logic (#8850) Stefan He 2025-08-07 16:53:34 -07:00
  • 3e7ff1ab1f fix: reasoning parser when request have enable_thinking flag (#8933) Xinyuan Tong 2025-08-07 15:52:06 -07:00
  • aaf0ad8cdf remove vllm fp8quant from fp8.py (#8937) Stefan He 2025-08-07 15:50:52 -07:00
  • 361379b52b docs: update README (#8929) Yineng Zhang 2025-08-07 14:28:35 -07:00
  • 1ac16add8b chore: support blackwell cu129 image (#8928) Yineng Zhang 2025-08-07 14:24:57 -07:00
  • c3a5fb3b28 codeowner updates for modelopt related files (#8925) Zhiyu 2025-08-07 14:21:41 -07:00
  • 4bf6e5a6b0 fix: use openai 1.99.1 (#8927) Yineng Zhang 2025-08-07 14:20:35 -07:00
  • 3ae33fcd0a Fix hopper launch gpt-oss model illegal memory (#8908) Xiaoyu Zhang 2025-08-08 01:02:40 +08:00
  • 500b15c960 [router] upgrade router version to 0.1.9 (#8844) Simo Lin 2025-08-07 09:29:12 -07:00
  • 16a4c66d25 [router] update pd router ci summary step with new threshold (#8916) Simo Lin 2025-08-07 07:15:38 -07:00
  • 89e6521c61 [router] re-enable pd router benchmark CI (#8912) Simo Lin 2025-08-07 06:29:36 -07:00
  • fd05b56750 refactor(sgl-router): Replace once_cell with LazyLock in worker.rs and remove once_cell dependency from Cargo.toml (#8698) Tien Nguyen 2025-08-07 20:14:03 +07:00
  • 482c3db29f Fix sgl-kernel arch and missing package in CI (#8869) fzyzcjy 2025-08-07 17:08:15 +08:00
  • 47824c1488 [Perf] Auto enable best flashinfer mxfp4 kernel in b200 (#8898) Xiaoyu Zhang 2025-08-07 16:08:41 +08:00
  • c36a6693f3 Disable gemma3 for SWA due to CUDA illegal memory access error (#8895) Xinyuan Tong 2025-08-07 00:44:44 -07:00
  • 62f8eb48b1 [CPU] Fix fallback allgather issue (#8041) blzheng 2025-08-07 15:08:18 +08:00
  • b7cd743038 [Feat] QWen-1M context support[2/2]: Update block sparse attention backend (#5949) PGFLMG 2025-08-07 14:49:36 +08:00
  • a69b637014 [router] fix req handling order, improve serialization, remove retry (#8888) Simo Lin 2025-08-06 23:24:39 -07:00
  • 2d120f8b18 [Feature][Multimodal] Implement LRU cache for multimodal embeddings (#8292) Zheng Wengang 2025-08-07 14:21:40 +08:00
  • 4f2e1490c3 [AMD] Pull latest SGLang version for AMD CI (#8787) michael-amd 2025-08-06 20:20:26 -07:00
  • 3fa3c6cd6a Enables force reasoning based on chat template for Qwen3-Thinking (#8369) Xinyuan Tong 2025-08-06 20:02:47 -07:00
  • 6210e2c4f0 Support GPU pinning for LoRA (#8697) Lifu Huang 2025-08-06 19:39:45 -07:00
  • 6ad6c8c9e6 feat: openai oss attention sink support with trtllm-gen backend #8825 (#8834) eigen 2025-08-06 22:18:27 -04:00
  • 5b6acc1495 fix glm4 moe (#8883) Cheng Wan 2025-08-06 18:02:31 -07:00
  • 4373df5525 add flashinfer mxfp4 (#8847) Xiaoyu Zhang 2025-08-07 07:23:41 +08:00
  • c0e84297c2 Use reduce scatter for DP (#8539) Trevor Morris 2025-08-06 16:21:26 -07:00
  • 92cc32d9fc Support v1/responses and use harmony in serving_chat (#8837) Chang Su 2025-08-06 16:20:34 -07:00
  • cbbd685a46 chore: use torch 2.8 stable (#8880) Yineng Zhang 2025-08-06 15:51:40 -07:00
  • 78aad91037 [CI] fix pip upgrade (#8881) Cheng Wan 2025-08-06 15:02:32 -07:00
  • 288ae41f7a [NVIDIA] Fix num_experts in modelopt_quant (#8811) Shu Wang 2025-08-06 16:35:07 -05:00
  • 01c99a9959 chore: update Dockerfile (#8872) Mick 2025-08-07 00:30:33 +08:00
  • b114a8105b Support B200 in CI (#8861) fzyzcjy 2025-08-06 21:42:44 +08:00
  • 0475448ee3 Optimize triton swa kernel by skipping computation (#8860) Ke Bao 2025-08-06 21:37:50 +08:00
  • 399e7ec8b3 Refine naming (#8868) Ke Bao 2025-08-06 21:37:02 +08:00
  • 1bd5316873 fix benchmark fp8 blockwise group gemm (#8815) Yuan Luo 2025-08-06 21:02:21 +08:00
  • aeac900ca2 fix: resolve ci issue (#8859) Yineng Zhang 2025-08-06 02:28:14 -07:00
  • 4fc5f2f977 Add unit test for triton swa kernel (#8853) Ke Bao 2025-08-06 16:10:38 +08:00
  • 168033d5fb Support mxfp4 for GPT-OSS (#8843) Ying Sheng 2025-08-06 00:05:25 -07:00
  • cbbb738371 [2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights (#8753) Stefan He 2025-08-05 22:09:52 -07:00
  • 89588179cf [1/3] Optimize Slime Update Weights: Remove QWen3MOE Load Weight Overhead (#8751) Stefan He 2025-08-05 22:07:54 -07:00
  • 8c7bb39dfb [router] PD Router Simplification and Reorganization (#8838) Simo Lin 2025-08-05 21:20:38 -07:00
  • ca47e24f5d [Feature] improve TBO: two chunk overlap (#8144) HouseWest 2025-08-06 12:11:01 +08:00
  • d26ca84f39 Support bailing moe (#8680) Praneth Paruchuri 2025-08-06 09:10:34 +05:30
  • 8128e08d36 Turn off hybrid cache by default (#8839) Ke Bao 2025-08-06 09:53:45 +08:00
  • 5d62b56f7e [router] complete router oai spec (#8828) Simo Lin 2025-08-05 18:30:19 -07:00
  • 3ae8e3ea8f chore: upgrade torch 2.8.0 (#8836) Yineng Zhang 2025-08-05 17:32:01 -07:00
  • c1d2061f97 Add initial support for gpt-oss (#8824) Ying Sheng 2025-08-05 13:42:01 -07:00
  • 556e4143f0 fix: remove unused import (#8809) Yineng Zhang 2025-08-05 13:40:22 -07:00
  • 4ef47839ae feat: use py312 (#8832) Yineng Zhang 2025-08-05 13:38:22 -07:00
  • 32d9e39a29 Fix potential memory fault issue and ncclSystemError in CI test (#8681) kk 2025-08-06 03:19:37 +08:00
  • 4f4e0e4162 chore: upgrade flashinfer 0.2.10 (#8827) Yineng Zhang 2025-08-05 12:04:01 -07:00
  • 901ab758ec chore: upgrade transformers 4.55.0 (#8823) Yineng Zhang 2025-08-05 11:37:21 -07:00
  • 8e8545caf6 fix: update cmake (#8817) Yineng Zhang 2025-08-05 09:38:30 -07:00
  • a4b0d5c9e5 GLM-4.5 and GLM-4.5-Air both support (#8804) Yuxuan Zhang 2025-08-05 18:29:20 +08:00