Commit Graph

  • 0ec580a86c Fix label PR (#10445) Lianmin Zheng 2025-09-14 20:33:09 -07:00
  • 0b14159fc4 Add reasoning examples for GPT-OSS in Markdown examples (#9626) Vincent Zhong 2025-09-14 23:27:40 -04:00
  • 1489cd6c02 [docs / oneliner] update mmmu docs instruction (#9768) Vincent Zhong 2025-09-14 23:26:39 -04:00
  • fc2c3a3d8e metrics: support customer labels specified in request header (#10143) Yingchun Lai 2025-09-15 11:00:08 +08:00
  • 8f6a175803 Fix label pr for ci (#10441) Lianmin Zheng 2025-09-14 19:48:06 -07:00
  • 010181388c Tiny fix wrong naming (#10437) fzyzcjy 2025-09-15 10:24:41 +08:00
  • 4844fac91d Refactor TopK to ensure readability and extensibility (#9338) Cheng Wan 2025-09-14 19:16:25 -07:00
  • b7d385e812 automatically label pr for ci (#10435) Lianmin Zheng 2025-09-14 19:13:11 -07:00
  • 305c9e8c2d [4/N]DP refactor: support watching mode get_load and shortest queue strategy (#10201) Liangsheng Yin 2025-09-15 10:06:08 +08:00
  • ca63f075b7 Revert "Fix FA4 import cause moe_fused_gate output be illegal memory" (#10432) fzyzcjy 2025-09-15 10:03:27 +08:00
  • f9ee6ae17a [router]: Add Embedding routing logic (#10129) Jintao Zhang 2025-09-15 09:44:35 +08:00
  • dcee42c200 feat: add dsv3 fp4 cutlass moe etp ut (#10433) Yineng Zhang 2025-09-14 18:44:09 -07:00
  • 258d02c86d Fix correction bias undefined behavior for nvfp4 models (#10426) fzyzcjy 2025-09-15 09:41:09 +08:00
  • 60d7beda6b Add split tile size for Triton attention (#10425) Ke Bao 2025-09-15 08:35:49 +08:00
  • 2f8ba6fe82 [Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path (#10429) Cheng Wan 2025-09-14 17:34:28 -07:00
  • 7ce6c10eb6 fix: enable cu124 and cu128 build on main push (#10431) Yineng Zhang 2025-09-14 16:19:35 -07:00
  • 55025b9282 fix: use latest flashinfer (#10428) Yineng Zhang 2025-09-14 15:07:14 -07:00
  • 4c21b09074 [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 (#9962) Feng Su 2025-09-15 02:08:02 +08:00
  • 165abeebca Typo: in --enable-custom-logit-processor: agree with cli arg (#10076) 艾力可 2025-09-14 11:27:09 +02:00
  • 21ca4c3afa [PD metrics] Fix some uncompleted PD related metrics (#8627) Yingchun Lai 2025-09-14 17:26:58 +08:00
  • e3cf812f7d Fix sgl-kernel + srt CI (#10419) fzyzcjy 2025-09-14 16:44:47 +08:00
  • 4da5533682 Support profile args in Engine API (#6539) fzyzcjy 2025-09-14 16:21:10 +08:00
  • ac964d2e58 Support global scale in addition to per expert scale for cutedsl moe (#10270) fzyzcjy 2025-09-14 16:17:00 +08:00
  • fa46e2bd40 Support offloading in fp8 (#9948) fzyzcjy 2025-09-14 16:14:28 +08:00
  • b047b553c2 [2/2] Speed up prefill mla attention concat (#10157) fzyzcjy 2025-09-14 16:12:04 +08:00
  • a0f844ed5a Let sgl-kernel changes be tested on srt (#10313) fzyzcjy 2025-09-14 16:09:17 +08:00
  • 2df532ef20 Fix the global scale fix does not support EPLB and improve enabling condition (#10369) fzyzcjy 2025-09-14 16:07:47 +08:00
  • abea9250da Auto determine sgl kernel version in blackwell CI (#10318) fzyzcjy 2025-09-14 16:06:30 +08:00
  • b3c977622f Add h200 fused moe config for Qwen3-Next (#10404) Ximingwang-09 2025-09-14 14:48:46 +08:00
  • b8347b40b1 Add self.capture_aux_hidden_states For GLM-4.5V (#10228) Yuxuan Zhang 2025-09-14 14:31:55 +08:00
  • 72dfa96aeb Fix cutlass moe accuracy drop caused by attention UB from DP padding mode (#10414) fzyzcjy 2025-09-14 13:29:09 +08:00
  • 05b01ef4da fix duplicated logger in eager_utils (#10410) lijin23 2025-09-14 13:25:40 +08:00
  • 55a6e644b0 [Hack] Add pd-disaggregation decode polling interval (#10411) Liangsheng Yin 2025-09-14 10:18:23 +08:00
  • 6897e06b69 Remove repeatedly lists adding in init_incremental_detokenization (#10412) Liangsheng Yin 2025-09-14 10:05:52 +08:00
  • a360511d7b [Generative Score API] Scoring(Prefill-only) optimizations. (#9748) Sundara Raman Ramachandran 2025-09-13 10:57:06 -07:00
  • 94d0f656fb [Performance] Dynamic Batch Tokenizer (#9382) Sundara Raman Ramachandran 2025-09-13 10:56:04 -07:00
  • eca59f96c3 Update ROCm docker image to add sgl-router support (#10406) kk 2025-09-14 01:52:18 +08:00
  • 9752861002 [Fix] Support qwen3-next MTP+DP (#10392) Binyao Jiang 2025-09-13 02:45:04 -07:00
  • 297d374510 support qwen3_next blackwell (#10403) Yi Zhang 2025-09-13 17:18:26 +08:00
  • 81eb07da4d Add new file maxiao1 2025-09-13 09:02:45 +00:00
  • c6f9781382 Update README.md maxiao1 2025-09-13 09:01:40 +00:00
  • 118f1fc726 sglangv0.5.2 & support Qwen3-Next-80B-A3B-Instruct maxiao1 2025-09-13 17:00:20 +08:00
  • 31e9d3a5aa [Fix] Init mamba related memory pools with torch.zeros (#10400) Binyao Jiang 2025-09-12 23:16:48 -07:00
  • 6f4676ef85 fix: tool parse in large streaming chunk beginning with normal content (#10397) Xinyuan Tong 2025-09-12 22:29:35 -07:00
  • c9ec4cae5b Fix the style of sgl kernel (#10398) Lianmin Zheng 2025-09-12 22:20:21 -07:00
  • 99757cc3e6 fix probs name which without temp scaling name (#9984) narutolhy 2025-09-12 21:19:57 -07:00
  • cdddab056c [Auto Sync] Update xgrammar_backend.py (20250913) (#10395) Lianmin Zheng 2025-09-12 17:46:56 -07:00
  • 7c5a0a1b77 [router] add not implemented functions for multi model trait (#10394) Simo Lin 2025-09-12 19:44:18 -04:00
  • 49f169d53e [HiCache] doc: update deployment in readme (#10332) Teng Ma 2025-09-13 07:35:37 +08:00
  • 7fce2fd91a [HiCache] fix mooncake config in different tp size (#10377) Teng Ma 2025-09-13 07:34:23 +08:00
  • 16cd550c85 Support Qwen3-Next on Ascend NPU (#10379) Even Zhou 2025-09-13 07:31:37 +08:00
  • d5e2a37414 Benchmark: Support API_KEY without 'bearer' (#10380) Muqi Li 2025-09-13 07:29:04 +08:00
  • 366043db8e [router] Add get and cancel method for response api (#10387) Keyang Ru 2025-09-12 16:19:38 -07:00
  • 2f173ea074 [router] allow one router to support different model families and serving mode (#10244) Simo Lin 2025-09-12 19:18:27 -04:00
  • 321fecab74 Add sentencepiece to project dependencies (#10386) Mohammad Miadh Angkad 2025-09-13 07:02:54 +08:00
  • 9d775b1a2d feat: add deepseek v3 fp4 ut (#10391) Yineng Zhang 2025-09-12 15:43:29 -07:00
  • 78b7465cad Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile (#10361) kk 2025-09-13 06:05:51 +08:00
  • 07bcad7fb7 [bug] fix router ci syntax error (#10390) Simo Lin 2025-09-12 17:39:15 -04:00
  • 98adac8ef7 fix: exclude protobuf generated code (#10388) Yineng Zhang 2025-09-12 13:52:09 -07:00
  • cef11e9a55 fix: resolve gb200 image link (#10343) Yineng Zhang 2025-09-12 13:46:02 -07:00
  • 2269cf1e2f [Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) (#10333) Lianmin Zheng 2025-09-12 12:55:55 -07:00
  • 151e287d1a fix: add fast path for function call (#9023) Yi Zhang 2025-09-13 01:28:54 +08:00
  • 8c86595c93 [router] enable sccache in ci and local build (#10099) Simo Lin 2025-09-12 12:43:48 -04:00
  • 4634fd5953 [router] Add Rerank Routing Logic in Regular Router (#10219) Frank Fang 2025-09-13 00:10:18 +08:00
  • efedbe6ca9 Fix global input scale incompatible with CuTe DSL moe (#10370) fzyzcjy 2025-09-12 18:22:49 +08:00
  • 3a77c80b26 Fix FA4 import cause moe_fused_gate output be illegal memory (#10368) fzyzcjy 2025-09-12 18:21:26 +08:00
  • 36acd2ff16 Fix chunked prefix cache for nvfp4 (#10180) Shu Wang 2025-09-12 05:20:30 -05:00
  • fe6cdf8972 add qwen3-next ut (#10355) Yi Zhang 2025-09-12 18:06:48 +08:00
  • 30d20ce84f Support loading weights from remote instance (#8215) amysaq2023 2025-09-12 17:40:22 +08:00
  • 1b1701f1f7 model: support dots.vlm1 model (#8778) chenge@xiaohongshu.com 2025-09-12 17:38:38 +08:00
  • 6d40308905 Revert add mainprocess's proctitle (#10351) ybyang 2025-09-12 16:48:30 +08:00
  • 24dc2bee97 Fix Bailing MoE model bugs (#10362) Yuan Luo 2025-09-12 15:36:02 +08:00
  • fac07c9b08 Support LingV2 model (#10359) strgrb 2025-09-12 14:53:52 +08:00
  • b3839a7f99 fix: resolve transfer_kv_all_layer_direct_lf_pf import error (#10360) Yineng Zhang 2025-09-11 23:53:23 -07:00
  • 4aa39d72c4 fix the break in FlashInferFusedMoE (#10356) chenqianfzh 2025-09-11 23:47:48 -07:00
  • b4c2c421e9 support memory_pool_host page first direct layout (#10031) huangtingwei 2025-09-12 14:19:44 +08:00
  • 53ca15529a Implement Standalone gRPC Server for SGLang Python Scheduler (#10283) Chang Su 2025-09-11 20:57:17 -07:00
  • a23bdeaf04 [router] Basic OAI Response api (#10346) Keyang Ru 2025-09-11 20:56:17 -07:00
  • 27778010fc fix dual stream bug (#10352) Yi Zhang 2025-09-12 11:53:42 +08:00
  • 46d8fb1c98 model: support Apertus (#9774) EduardDurech 2025-09-12 05:49:10 +02:00
  • c7e85f5378 fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale (#10296) Trevor Morris 2025-09-11 20:19:17 -07:00
  • 3df05f4d6a [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199) Shu Wang 2025-09-11 22:18:43 -05:00
  • 7b141f816c [router][ci] Add gpu utilization analyze with nvml (#10345) Keyang Ru 2025-09-11 19:26:02 -07:00
  • 7bc5fb0d78 [CPU][doc] add torch.compile param in example commands (#10349) Zaili Wang 2025-09-12 10:22:46 +08:00
  • 144ee5f37c [Auto Sync] Update server_args.py (20250912) (#10347) Lianmin Zheng 2025-09-11 19:18:07 -07:00
  • b0d25e72c4 chore: bump v0.5.2 (#10221) Yineng Zhang 2025-09-11 16:09:20 -07:00
  • a2424068ec add try catch for quant config hf download (#10340) gongwei-130 2025-09-11 15:00:21 -07:00
  • c5d2b01cea [LongCat] Optimize zero_experts_compute_triton by changing mask (#10303) zk-lover 2025-09-12 05:56:25 +08:00
  • 46ccbed2cd update GLM nightly test threshold (#10331) Minglei Zhu 2025-09-11 14:54:58 -07:00
  • fe68c1486f Fix errors of hicache kernels in sgl-kernel for ROCm (#10339) Hubert Lu 2025-09-11 14:54:34 -07:00
  • 70c0c1f926 fix: trtllm-gen attention take zero-init workspace (#10330) eigen 2025-09-11 17:35:23 -04:00
  • 760b788a58 add qwen3-next doc (#10327) Yi Zhang 2025-09-12 05:29:11 +08:00
  • 1ee11df8ac [router][ci] add gpu process check and free port before start server (#10338) Keyang Ru 2025-09-11 14:24:16 -07:00
  • dee197e11b [router] Add OpenAI backend support - core function (#10254) Keyang Ru 2025-09-11 14:13:51 -07:00
  • ab795ae840 add h20 qwen3 next config (#10264) Yi Zhang 2025-09-12 05:02:24 +08:00
  • 480d1b8b20 [router] add benchmark for regular router and pd router (#10280) Keyang Ru 2025-09-11 12:04:11 -07:00
  • 6c18ab46a2 [Qwen3-Next] switch to triton and cache conv states to accelerate MTP from 300 tok/s to 341 tok/s (#10335) Stefan He 2025-09-11 11:59:48 -07:00
  • 4a0e0be2a2 [bugfix] fix norm type error in qwen3_next model (#10322) cao1zhg 2025-09-12 00:05:59 +08:00
  • 64f296f8e6 [Minor] Improve the style of server args (#10328) Lianmin Zheng 2025-09-11 07:06:29 -07:00
  • 956d805dde [Auto Sync] Update parallel_state.py (20250911) (#10326) Lianmin Zheng 2025-09-11 06:36:29 -07:00