Commit Graph

  • 85ed8e0a5e Optimize nvfp4 block scaled gemm kernel when M is small. (#10101) Qi Yuhang 2025-09-07 13:31:00 +08:00
  • dd1e268938 CUTLASS fp8 blockwise gemm support of sm120 (#9969) Jianying 2025-09-07 13:28:54 +08:00
  • 9a7ced4e4d [Feature] LMCache Connector Integration (#9741) Yuwei An 2025-09-06 20:14:55 -07:00
  • cb3918a091 Optimize moe_sum_reduce_kernel (#9477) Yuan Luo 2025-09-07 09:16:18 +08:00
  • f3b6760213 [Auto Sync] Update server_args.py (20250906) (#10117) Lianmin Zheng 2025-09-06 16:59:36 -07:00
  • 9eb50ecc9c [router] Improve the router e2e tests (#10102) Keyang Ru 2025-09-06 16:19:28 -07:00
  • b3e7a2cee4 increase the rust e2e timeout (#10116) Keyang Ru 2025-09-06 16:17:34 -07:00
  • 00974e4f6e [CI] Refactor disaggregation tests (#10068) Shangming Cai 2025-09-06 22:14:46 +08:00
  • 5f1eb20484 [chore] Remove unused ep_moe cuda kernels (#9956) hlu1 2025-09-06 01:35:50 -07:00
  • 039cef76aa Remove non-accelerated targets(100 and up) from cmake (#10041) hlu1 2025-09-06 01:35:28 -07:00
  • 4c22ebe2e8 Disable kernel cutlass_mla_decode on SM103 (#10058) hlu1 2025-09-06 01:35:18 -07:00
  • a5a03209e9 Fix circular import (#10107) Cheng Wan 2025-09-06 01:34:17 -07:00
  • 21af5c0404 [Fix] Compatibility between DP attention and pipeline parallelism (#10100) Cheng Wan 2025-09-06 01:34:10 -07:00
  • 012584ecd5 perf: Avoid unnecessary data type conversions for DeepSeek-V3 on Blackwell (#9834) Jinyang Yuan 2025-09-06 14:06:46 +08:00
  • 90dfe3de4c [NVIDIA] disable chunked prefix cache when dp and blackwell is used (#9861) Kaixi Hou 2025-09-05 23:05:16 -07:00
  • 9a719b7afc [NVIDIA] Remove unused get_fused_moe_impl_class function (#9764) Kaixi Hou 2025-09-05 22:41:22 -07:00
  • 3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) Cheng Wan 2025-09-05 21:09:09 -07:00
  • dbb1235d58 [Fix] illegal sync based on undefined behaviour (#9620) DevashishLal-CB 2025-09-05 20:54:48 -07:00
  • ad26f298e2 fix double sparsity initialization (#6905) Chi-Chih Chang 2025-09-06 11:45:24 +08:00
  • 8d114f254b Fix RMSNorm API CALL mismatch issue. (#10032) sogalin 2025-09-06 11:45:13 +08:00
  • 0e78c63c0e Revert "[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953)" (#10097) Yineng Zhang 2025-09-05 19:57:53 -07:00
  • 1a3d6f31da Modify ci workflow for auto-partitioning in 2-GPU backend tests (#10029) hzh0425 2025-09-06 10:28:42 +08:00
  • 0b8c5721f1 [HiStorage] Remove delete and clear as necessary methods (#10039) Zhiqiang Xie 2025-09-05 19:27:26 -07:00
  • beac202bfd Add lora_path argument to bench_multiturn.py (#10092) Baizhou Zhang 2025-09-05 19:20:42 -07:00
  • 21b9a4b435 [router] Introduce router integration tests (#10086) Keyang Ru 2025-09-05 18:52:53 -07:00
  • db37422c92 [router] move to mcp sdk instead (#10057) Simo Lin 2025-09-05 21:03:46 -04:00
  • ab62b135c1 support Llama4 with non uniformed intermediate size across layers for… (#10047) gongwei-130 2025-09-05 17:28:15 -07:00
  • 273b28344b [Minor] Refactors KV memory pool (#9842) Xinyuan Tong 2025-09-06 00:06:08 +00:00
  • f84db115b1 Add storage read/write bandwidth logs to monitor kvcache performance (#9965) pansicheng 2025-09-06 07:52:55 +08:00
  • efb0de2c8d Update wave-lang to 3.7.0 and unify Wave kernel buffer options (#10069) jacky.cheng 2025-09-06 07:01:52 +08:00
  • 0f6ac5e21d [Bug Fix] Fix Glm4vVisionBlock norm (#9884) Adam Yanxiao Zhao 2025-09-06 05:20:36 +08:00
  • 2985090084 Update flashinfer to 0.3.1 for B300 support (#10087) hlu1 2025-09-05 13:41:01 -07:00
  • e678cc717d [bugfix]: use correct cache location for cross attention in torch native backend (#8622) Mahmoud Ashraf 2025-09-05 23:39:46 +03:00
  • 4efe844a25 enable aiter gemm_a8w8_bpreshuffle for ptpc gemm (#8555) Morpheus Guo 2025-09-06 03:54:40 +08:00
  • bde73ee43f [router] add rust cache in benchmark ci (#10080) Simo Lin 2025-09-05 12:59:36 -04:00
  • 4f0e28d7fc [router] add rust cache for rust unit test (#10079) Keyang Ru 2025-09-05 09:58:59 -07:00
  • 045ab92dc0 [router] add py binding unit tests to coverage 80% (#10043) Keyang Ru 2025-09-05 08:40:21 -07:00
  • bd7f882142 Support copying tensor from cpu to gpu without using copy engines (#10007) fzyzcjy 2025-09-05 20:07:19 +08:00
  • 5e5c30d9ab Tiny let DeepGEMM scale checks cover more cases (#7182) fzyzcjy 2025-09-05 19:52:32 +08:00
  • 9f00ec44eb Fix and enhance dumper (#8725) fzyzcjy 2025-09-05 19:51:09 +08:00
  • 8e85ee887e Support simple evals in text comparator (#8867) fzyzcjy 2025-09-05 19:50:21 +08:00
  • adf73175d6 Forbid DeepEP racing condition when too many tokens (#9567) fzyzcjy 2025-09-05 19:47:05 +08:00
  • 13705dae06 [Fix] Add speculative_draft_model_revision to server_args (#5255) DevashishLal-CB 2025-09-05 04:45:46 -07:00
  • df97b31f37 Tiny support setting numa nodes for different ranks (#10006) fzyzcjy 2025-09-05 19:01:27 +08:00
  • 339f8eef09 [1/2] Optimizations and refactors about quant kernel (#9534) fzyzcjy 2025-09-05 18:45:08 +08:00
  • afd9f2f560 Fix typo in scheduler (#9934) limingshu 2025-09-05 17:45:27 +08:00
  • f40038fb09 [Vulnerability]feat(conn): set bootstrap server host (#9931) Jimmy 2025-09-05 17:36:17 +08:00
  • bebd0576e5 Integrate trtllm ragged attention for prefill self-attention (#9801) Elfie Guo 2025-09-05 02:18:00 -07:00
  • f98366604b fix MultiTokenizerWrapper name (#10049) Huang Long 2025-09-05 13:39:46 +08:00
  • 8b3b995ac9 [router] fix release workflow to include protobuf (#10055) Chang Su 2025-09-04 22:09:30 -07:00
  • 6e95f5e5bd Simplify Router arguments passing and build it in docker image (#9964) Liangsheng Yin 2025-09-05 12:13:55 +08:00
  • 0e9387a95d fix: update gb200 dep (#10052) Yineng Zhang 2025-09-04 20:30:46 -07:00
  • fa9c82d339 chore: bump v0.5.2rc2 (#10050) Yineng Zhang 2025-09-04 20:07:27 -07:00
  • 918e3d4c27 Fix accuracy drop of dsv3 run in dp enablement (#8677) kk 2025-09-05 07:51:16 +08:00
  • e96973742c Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008) kk 2025-09-05 06:11:22 +08:00
  • 93088b6975 [Hicache] Mooncake API Fix & Test, and Improved Readme (#9951) ykwd 2025-09-05 04:55:39 +08:00
  • 453511acc7 Save memory for expert model parallel (#9957) Cheng Wan 2025-09-04 13:31:47 -07:00
  • d07304870b fix 3fs zerocopy (#9938) pansicheng 2025-09-05 04:24:12 +08:00
  • b32ab0705e metrics: support customer buckets for prompt/generation_tokens_histogram (#9634) Yingchun Lai 2025-09-04 22:22:08 +08:00
  • 75ee00112d [Doc] Fix SGLang tool parser doc (#9886) Huapeng Zhou 2025-09-04 09:52:53 -04:00
  • ec15c8360e Optimize Qwen3-moe model by using flashinfer fused allreduce (#9973) Yuan Luo 2025-09-04 20:48:53 +08:00
  • 106c2b31fb feat(hicache): Add generic hicache ci e2e test and benchmark test (#9846) hzh0425 2025-09-04 20:43:46 +08:00
  • c67569491c Ensure chunked request extension length respects both rem_chunk_tokens and rem_total_tokens limits (#10003) pansicheng 2025-09-04 19:15:26 +08:00
  • 27e8ffed37 [1/N] DP-refactor: move dp balance code into scheduler's mixin class (#10004) Liangsheng Yin 2025-09-04 16:53:58 +08:00
  • 4dbb34fe43 fix: health_generate endpoint in mini_lb (#9997) wxsm 2025-09-04 16:52:28 +08:00
  • 1e18a341e9 [Bugfix] fix pd chat completion protocol for batching support (#10016) Tony Lu 2025-09-04 16:43:16 +08:00
  • 2e28654bed Upload New File maxiao1 2025-09-04 08:34:12 +00:00
  • 3ccce90049 Update README.md maxiao1 2025-09-04 08:14:12 +00:00
  • 909abb58f5 adapt to sglang v0.5.2rc1 on dcu maxiao 2025-09-04 15:56:33 +08:00
  • 2c562fd2d0 Fix Llama 4 with MXFP4 dynamic quant on MI35x (#9993) Hubert Lu 2025-09-04 00:48:58 -07:00
  • b648d86216 [Fix] gpt-oss mxfp4 model run failed on ROCm platform (#9994) kk 2025-09-04 13:34:17 +08:00
  • bbf261ae4a [router] fix grpc connection mode detection (#9999) Simo Lin 2025-09-04 00:36:16 -04:00
  • 4f8a982d52 [router] clean up dependency injector to use ctx (#10000) Simo Lin 2025-09-04 00:35:51 -04:00
  • d966b902af [router] move tokenizer, reasoning, tool initialization to server (#9996) Simo Lin 2025-09-03 22:35:13 -04:00
  • de9217334b feat: add gpt oss b200 ci (#9988) Yineng Zhang 2025-09-03 17:26:38 -07:00
  • 397448ebbc [Auto Sync] Update parallel_state.py, few_shot_gsm8k.py (20250903) (#9986) Lianmin Zheng 2025-09-03 16:55:43 -07:00
  • 66d5d0425c Minor update regarding issue #9704 (#9733) Elfie Guo 2025-09-03 16:52:07 -07:00
  • 73179b764a nsys profile output kernel classifier (#9314) Grace Ho 2025-09-03 16:22:33 -07:00
  • 8cbf71dc2d Triton 3.4.0 MoE config for Deepseek TP16 H100 (#9978) Szymon Ożóg 2025-09-03 22:16:16 +02:00
  • 56eb5d0a3d fix swa clear(): rename is_in_free_group to is_not_in_free_group (#9914) Xinyuan Tong 2025-09-03 18:42:12 +00:00
  • 4ed9053ecf Remove mrope position sync (#9460) timmy-feng 2025-09-03 11:40:53 -07:00
  • 5e19b159b0 [router] add chat_template_kwargs in ChatCompletionRequest (#9958) Tony Lu 2025-09-04 01:43:52 +08:00
  • 788b19a532 [router] Add Rerank API Specification (#9906) Frank Fang 2025-09-03 23:30:29 +08:00
  • f78b7fd16d [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953) Yuhao Yao 2025-09-03 18:28:27 +08:00
  • b1fb7e458c [benchmark] add flashinfer_allreduce_fusion benchmark (#9937) Xiaoyu Zhang 2025-09-03 16:31:01 +08:00
  • 1b2ff4fb7f Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" (#9959) Yineng Zhang 2025-09-03 00:50:04 -07:00
  • 2c7ca33abb Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" (#9955) Yineng Zhang 2025-09-02 23:49:56 -07:00
  • df397a72e8 [feat] Add P/D attention select for draft model (#9755) Ximingwang-09 2025-09-03 13:47:23 +08:00
  • 5dfcd6c207 add proctitle for tokenizers (#9952) Liangsheng Yin 2025-09-03 13:31:38 +08:00
  • 0dfd54d11d Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671) kk 2025-09-03 13:26:28 +08:00
  • bcbeed714f Qwen FP8/NVFP4 ModelOPT Quantization support (#7912) jingyu-ml 2025-09-02 22:56:03 -05:00
  • cc9a31c662 Update tool_chat_template_deepseekv31.jinja (#9895) 南京小汤包 2025-09-03 11:29:21 +08:00
  • d631290e32 Remove annoying warnings in sgl kernel build (#9905) Lianmin Zheng 2025-09-02 20:18:25 -07:00
  • 37565b7f21 fix(cache): move ongoing_prefetch pop after validation to prevent leak (#9927) JinYan Su 2025-09-03 10:39:34 +08:00
  • 6243c36702 [Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946) Al-Ekram Elahee Hridoy 2025-09-02 20:31:15 -06:00
  • 60e37f8028 Move parsers under a single folder (#9912) Lianmin Zheng 2025-09-02 18:25:04 -07:00
  • 369b143366 [HiCache] Minor fix on file storage backend (#9869) Zhiqiang Xie 2025-09-02 15:52:37 -07:00
  • 03dbf1aa8e [model] support MiniCPM-V 4.0 (#8747) tc-mb 2025-09-03 06:33:03 +08:00
  • 11dcabc545 Grpc client (#9939) Chang Su 2025-09-02 11:47:35 -07:00
  • 4d89389c4f Fix the key passing issue in page first layout. (#9929) hzh0425 2025-09-03 02:30:11 +08:00