Chang Su
|
6ade6a02d4
|
[grpc] Support gRPC standard health check (#11955)
|
2025-10-22 16:59:09 -07:00 |
|
Baizhou Zhang
|
983ef22cf3
|
[Doc] Update deterministic inference flag in server_arguments.md (#11978)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-22 14:12:15 -07:00 |
|
Christian Bahls
|
164302c7df
|
Implement BGE-M3 Sparse Embeddings in SGLang (#10869)
Co-authored-by: Christian Bahls <christian.bahls@planet-ai.de>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-22 13:46:16 -07:00 |
|
Simo Lin
|
5dccf69713
|
[router] create worker removal step and clean up worker manager (#11921)
|
2025-10-22 13:26:06 -07:00 |
|
jiahanc
|
eec9e471ca
|
[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
|
2025-10-22 13:11:16 -07:00 |
|
Lianmin Zheng
|
6d535b719f
|
Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980)
|
2025-10-22 11:50:26 -07:00 |
|
yuho
|
fdcb1d13c5
|
[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977)
|
2025-10-22 11:29:55 -07:00 |
|
Hongbo Xu
|
d7e834d6ba
|
[6/n]decouple quantization implementation from vLLM dependency (#10750)
|
2025-10-23 02:07:55 +08:00 |
|
Minglei Zhu
|
200a3c0bb1
|
[Documentation] add doc for deterministic inference (#11956)
|
2025-10-22 12:36:15 -05:00 |
|
Keyang Ru
|
77258ce039
|
[router] Support multiple worker URLs for OpenAI router (#11723)
|
2025-10-22 09:27:58 -07:00 |
|
Fan Yin
|
1d097aac87
|
[Fix] Remove unused import from triton_kernels_moe.py (#11967)
Co-authored-by: Shangming Cai <171321666+shangmingcai@users.noreply.github.com>
|
2025-10-22 21:02:57 +08:00 |
|
Shangming Cai
|
7fceeef599
|
Fix flaky hicache test with mooncake backend (#11953)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-22 21:00:47 +08:00 |
|
996_icu
|
88568c01eb
|
[model] Support POINTSV15Chat (#9651)
Co-authored-by: josephyou <josephyou@tencent.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: root <root@TENCENT64.site>
|
2025-10-22 16:58:17 +08:00 |
|
Hank Han
|
904655c5fd
|
[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606)
Co-authored-by: Xun Sun <UNIDY2002@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-22 01:13:31 -07:00 |
|
Xun Sun
|
e028af6998
|
Fix mooncake dispatcher (#11908)
|
2025-10-22 01:11:49 -07:00 |
|
Zhiyu
|
80b2b3207a
|
Enable native ModelOpt quantization support (3/3) (#10154)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-21 21:44:29 -07:00 |
|
Johnny
|
4b65ed42cc
|
[NVIDIA] upstream FA4 and fix cccl path (#11929)
|
2025-10-21 21:18:25 -07:00 |
|
Fan Yin
|
23afdfd1c2
|
[sgl-kernel] support flashmla libtorch (#11717)
|
2025-10-21 21:17:50 -07:00 |
|
Liangsheng Yin
|
9d61205dac
|
[lint] improve ruff check (#11922)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2025-10-22 11:32:50 +08:00 |
|
Chang Su
|
590bc4b7a7
|
[router][grpc] Fix background tasks stored with wrong id (#11945)
|
2025-10-21 18:38:51 -07:00 |
|
Keyang Ru
|
63cfe1b032
|
[router] Add gRPC E2E test suite (#11790)
|
2025-10-21 17:51:21 -07:00 |
|
Chang Su
|
70f6309cd4
|
[router][grpc] Support v1/responses API (#11926)
|
2025-10-21 17:41:48 -07:00 |
|
Yineng Zhang
|
704160017d
|
fix: resolve flashinfer 0.4.1 import (#11940)
|
2025-10-21 17:19:57 -07:00 |
|
Keyang Ru
|
87a92e459a
|
Fix openai input_text type compatibility (#11935)
|
2025-10-21 16:10:35 -07:00 |
|
Yineng Zhang
|
c461e7714d
|
[Auto Sync] Update forward_batch_info.py (20251021) (#11934)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: yinghui <32845984+cicirori@users.noreply.github.com>
|
2025-10-21 15:52:15 -07:00 |
|
Zheng Wengang
|
fde2decf8b
|
[BugFix][Qwen3-VL]: add metadata for video in qwen3-vl (#11377)
|
2025-10-21 15:36:01 -07:00 |
|
Yineng Zhang
|
9792b9d7e3
|
chore: upgrade flashinfer 0.4.1 (#11933)
|
2025-10-21 14:46:31 -07:00 |
|
Baizhou Zhang
|
ef4a8097b8
|
Rename flashmla kernel options of nsa backend for better readability (#11876)
|
2025-10-21 13:14:16 -07:00 |
|
Baizhou Zhang
|
ebff4ee648
|
Update sgl-kernel and remove fast hadamard depedency (#11844)
|
2025-10-21 13:13:54 -07:00 |
|
Serge Panev
|
2b1da821b5
|
[NVIDIA] Add new SMs support for Spark & Thor (#11287)
Signed-off-by: Serge Panev <spanev@nvidia.com>
|
2025-10-22 02:02:24 +08:00 |
|
Liangsheng Yin
|
97710ccd1a
|
Fix flush cache API for spec v2 (#11918)
|
2025-10-21 23:01:16 +08:00 |
|
Shangming Cai
|
f3cd5d2510
|
[CI] Fix b200 flashinfer installation (#11915)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-21 22:28:50 +08:00 |
|
Kai-Hsun Chen
|
c61b0b294c
|
[quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n (#11702)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
|
2025-10-21 21:25:28 +08:00 |
|
Vincent Zhong
|
e8640ee9be
|
[smol] [perf] Inverse perm improvement (#11482)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
|
2025-10-21 19:18:10 +08:00 |
|
b8zhong
|
d0a64c7e2c
|
vlm: enforce pybase64 for image and str encode/decode (#10700)
|
2025-10-21 19:05:32 +08:00 |
|
Shangming Cai
|
05d3667ab9
|
[CI] disable glm4.1v and fix the flashinfer installation (#11902)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-21 18:38:35 +08:00 |
|
Zhengke Zhou
|
260fe755b6
|
Simplify multi-tokenizer (#11295)
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
|
2025-10-21 16:33:29 +08:00 |
|
ybyang
|
dbb16bedd5
|
Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] (#11416)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: YorkSu <york_su@qq.com>
|
2025-10-21 16:27:56 +08:00 |
|
Hank Han
|
c1e1600373
|
[fix] fix ci uv install dependency (#11895)
|
2025-10-21 16:23:34 +08:00 |
|
Neelabh Sinha
|
852c0578fd
|
[FEATURE] Add OpenAI-Compatible LoRA Adapter Selection (#11570)
|
2025-10-21 15:44:33 +08:00 |
|
Atream
|
7e6191c098
|
init support for KTransformers Heterogeneous Computing (#11487)
Co-authored-by: Jianwei Dong <1913953267@qq.com>
|
2025-10-21 00:17:02 -07:00 |
|
Gaurav Verma
|
6f9b66bdda
|
[AMD] Update wave-lang to 3.8.0 (#11878)
Signed-off-by: xintin <gaurav.verma@amd.com>
|
2025-10-20 23:11:09 -07:00 |
|
Simo Lin
|
8a801ee38d
|
[router] release router 0.2.1 (#11885)
|
2025-10-20 21:08:45 -07:00 |
|
Qiaolin Yu
|
d9a20fd28a
|
Use trtllm_mla decode kernel for draft extend in speculative decoding (#11664)
|
2025-10-21 11:42:09 +08:00 |
|
Meng, Hengyu
|
b113c72e7a
|
Init attention backend for Intel XPU (#10656)
Co-authored-by: guangyey <guangye.yu@intel.com>
Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
|
2025-10-21 11:41:28 +08:00 |
|
zhangdonghao-zdh
|
fb6cc7b000
|
Fix RotaryEmbedding for fp32 input (#11843)
|
2025-10-21 10:56:48 +08:00 |
|
Xiaoyu Zhang
|
8374a96e49
|
piecewise cuda graph support qwen3-moe (#11845)
|
2025-10-21 10:55:49 +08:00 |
|
Yuan Luo
|
74de76c685
|
Revise MRotaryEmbedding's forward (#11859)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: 羽癫 <yudian.zy@antgroup.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
|
2025-10-21 10:38:29 +08:00 |
|
Chang Su
|
9c0b1eb5ad
|
[router][grpc] Fix wram-up random token ids for small models (#11887)
|
2025-10-20 19:22:17 -07:00 |
|
Lianmin Zheng
|
01f14a7ad2
|
[code move] move pp into a separate mixin (#11838)
|
2025-10-20 18:46:56 -07:00 |
|