Commit Graph

4133 Commits

Author SHA1 Message Date
lizhigong
143ec5f36c adaptation w4A8 quantization
(cherry picked from commit 848c5b8290)
2025-10-25 15:07:04 +08:00
lizhigong
67510e0172 adaptation part w4A8 quantization
(cherry picked from commit 68277eac30)
2025-10-25 15:06:27 +08:00
maxiao
251235c229 适配v0.5.4 2025-10-25 12:16:25 +08:00
sglang-bot
1053e1be17 chore: bump SGLang version to 0.5.4 (#12027)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2025-10-23 18:01:40 -07:00
Roger Young
dbd9435dc1 Fix mamba radix cache eviction logic in alloc_req_slots (#11616)
Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>
2025-10-23 13:07:43 -07:00
b8zhong
8ae9d4bb41 Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" (#12028) 2025-10-23 12:42:59 -07:00
Nicolas Castet
1c304aa9bc Log iteration # for prefill and decode (#9366) 2025-10-23 12:28:03 -07:00
Mick
770529a731 model: support deepseek-ocr (#11891)
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-10-24 03:15:17 +08:00
ErvinXie
39c237f02c Add AWQ quantization support for NPU. (#10158)
Co-authored-by: Alisehen <814073252@qq.com>
Co-authored-by: Yaochen Han <48639761+Alisehen@users.noreply.github.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
2025-10-23 12:08:05 -07:00
Lianmin Zheng
ab07cd3e5a [Auto Sync] Update test_deterministic_utils.py (20251023) (#12022)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-10-23 11:20:45 -07:00
Netanel Haber
a98496834b Feature/nano v2 offline modelopt fp8 and nvfp4 (#12018)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-10-23 11:16:46 -07:00
Teng Ma
96a5e4dd79 [Feature] Support loading weights from ckpt engine worker (#11755)
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Co-authored-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-23 09:23:30 -07:00
cctry
b0b4f71679 [Fix] memory leak by overlap + retract (#11981)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2025-10-23 22:59:23 +08:00
Liangsheng Yin
6c18addb6f Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" (#12015) 2025-10-23 21:27:58 +08:00
Liangsheng Yin
32852fe9e9 Move memory runtime checker to mixin class (#12014) 2025-10-23 20:53:26 +08:00
Netanel Haber
d6fee73d1f Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 (#11866) 2025-10-23 17:29:02 +08:00
Qiaolin Yu
36a4cad7b0 Support overlap-spec-v2 with trtllm_mla attention backend (#11821) 2025-10-23 16:55:35 +08:00
yinghui
c23eda8589 Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend (#11985) 2025-10-22 22:44:45 -07:00
Jue WANG
138ff23187 Allow to disable batch decoding. (#11944) 2025-10-22 21:57:12 -07:00
blzheng
13fb8b5489 [CPU] Optimize FP16 decode_attention_cpu (#10652) 2025-10-22 21:39:51 -07:00
Zaili Wang
007b849b0e [CPU] misc updates (#11906) 2025-10-22 21:10:05 -07:00
Johnny
e7aa4664b3 [NVIDIA] Build CUDA 13 (#11299)
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-22 20:03:12 -07:00
b8zhong
4d4feccbb2 [ROCm] Remove vLLM rope dependency & use AITER impl (#11322) 2025-10-22 19:17:34 -07:00
jacky.cheng
99c92ff24b [AMD] Support a new flag to disable quant on parallelLinear layer if required (#11811) 2025-10-22 19:16:15 -07:00
Chang Su
6ade6a02d4 [grpc] Support gRPC standard health check (#11955) 2025-10-22 16:59:09 -07:00
Christian Bahls
164302c7df Implement BGE-M3 Sparse Embeddings in SGLang (#10869)
Co-authored-by: Christian Bahls <christian.bahls@planet-ai.de>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-22 13:46:16 -07:00
jiahanc
eec9e471ca [NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-10-22 13:11:16 -07:00
Lianmin Zheng
6d535b719f Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980) 2025-10-22 11:50:26 -07:00
yuho
fdcb1d13c5 [BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977) 2025-10-22 11:29:55 -07:00
Hongbo Xu
d7e834d6ba [6/n]decouple quantization implementation from vLLM dependency (#10750) 2025-10-23 02:07:55 +08:00
Fan Yin
1d097aac87 [Fix] Remove unused import from triton_kernels_moe.py (#11967)
Co-authored-by: Shangming Cai <171321666+shangmingcai@users.noreply.github.com>
2025-10-22 21:02:57 +08:00
996_icu
88568c01eb [model] Support POINTSV15Chat (#9651)
Co-authored-by: josephyou <josephyou@tencent.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: root <root@TENCENT64.site>
2025-10-22 16:58:17 +08:00
Hank Han
904655c5fd [2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606)
Co-authored-by: Xun Sun <UNIDY2002@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-22 01:13:31 -07:00
Xun Sun
e028af6998 Fix mooncake dispatcher (#11908) 2025-10-22 01:11:49 -07:00
Zhiyu
80b2b3207a Enable native ModelOpt quantization support (3/3) (#10154)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-10-21 21:44:29 -07:00
Liangsheng Yin
9d61205dac [lint] improve ruff check (#11922)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-10-22 11:32:50 +08:00
Chang Su
70f6309cd4 [router][grpc] Support v1/responses API (#11926) 2025-10-21 17:41:48 -07:00
Yineng Zhang
704160017d fix: resolve flashinfer 0.4.1 import (#11940) 2025-10-21 17:19:57 -07:00
Yineng Zhang
c461e7714d [Auto Sync] Update forward_batch_info.py (20251021) (#11934)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: yinghui <32845984+cicirori@users.noreply.github.com>
2025-10-21 15:52:15 -07:00
Zheng Wengang
fde2decf8b [BugFix][Qwen3-VL]: add metadata for video in qwen3-vl (#11377) 2025-10-21 15:36:01 -07:00
Yineng Zhang
9792b9d7e3 chore: upgrade flashinfer 0.4.1 (#11933) 2025-10-21 14:46:31 -07:00
Baizhou Zhang
ef4a8097b8 Rename flashmla kernel options of nsa backend for better readability (#11876) 2025-10-21 13:14:16 -07:00
Baizhou Zhang
ebff4ee648 Update sgl-kernel and remove fast hadamard depedency (#11844) 2025-10-21 13:13:54 -07:00
Serge Panev
2b1da821b5 [NVIDIA] Add new SMs support for Spark & Thor (#11287)
Signed-off-by: Serge Panev <spanev@nvidia.com>
2025-10-22 02:02:24 +08:00
Liangsheng Yin
97710ccd1a Fix flush cache API for spec v2 (#11918) 2025-10-21 23:01:16 +08:00
Kai-Hsun Chen
c61b0b294c [quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n (#11702)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
2025-10-21 21:25:28 +08:00
Vincent Zhong
e8640ee9be [smol] [perf] Inverse perm improvement (#11482)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
2025-10-21 19:18:10 +08:00
b8zhong
d0a64c7e2c vlm: enforce pybase64 for image and str encode/decode (#10700) 2025-10-21 19:05:32 +08:00
Zhengke Zhou
260fe755b6 Simplify multi-tokenizer (#11295)
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2025-10-21 16:33:29 +08:00
ybyang
dbb16bedd5 Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] (#11416)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: YorkSu <york_su@qq.com>
2025-10-21 16:27:56 +08:00