Zhihao Zhang
|
e7bc600304
|
[Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-09-18 16:42:41 -07:00 |
|
yuk.igalaxy
|
9a5c42f9ad
|
feat: Add FlexAttention Backend for Efficient Sparse Attention (#9947)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-09-18 11:49:17 -07:00 |
|
Xuchun Shang
|
1ccd59c715
|
[HICache] introduce evict policy (#10190)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
|
2025-09-18 11:10:20 +08:00 |
|
Kevin Xiang Li
|
de28f8e741
|
vlm: remove redundant d2h movement of mm feature tensors (#9987)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
|
2025-09-17 15:00:39 -07:00 |
|
harrisonlimh
|
14fdd52740
|
feat: add priority based scheduling with priority based request acceptance and preemption (#8746)
|
2025-09-16 17:10:10 -07:00 |
|
cicirori
|
a2f7218a2e
|
support using fa4 on deepseek on blackwell (#9928)
|
2025-09-16 16:16:06 -07:00 |
|
Zaili Wang
|
925dbb3218
|
[CPU] fix CPU backend sel. issue for Llama4 (#10511)
|
2025-09-16 02:57:45 -07:00 |
|
Lifu Huang
|
3f41b48c40
|
[2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (#10286)
|
2025-09-15 16:04:03 -07:00 |
|
Jimmy
|
3795b6a43f
|
fix(server_args): Skip chunked_prefill_size validation when disaggregation mode is decode (#10358)
|
2025-09-15 12:13:35 +08:00 |
|
Yingchun Lai
|
fc2c3a3d8e
|
metrics: support customer labels specified in request header (#10143)
|
2025-09-14 20:00:08 -07:00 |
|
Liangsheng Yin
|
305c9e8c2d
|
[4/N]DP refactor: support watching mode get_load and shortest queue strategy (#10201)
|
2025-09-15 10:06:08 +08:00 |
|
Ke Bao
|
60d7beda6b
|
Add split tile size for Triton attention (#10425)
|
2025-09-14 17:35:49 -07:00 |
|
Feng Su
|
4c21b09074
|
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 (#9962)
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
|
2025-09-15 02:08:02 +08:00 |
|
Liangsheng Yin
|
55a6e644b0
|
[Hack] Add pd-disaggregation decode polling interval (#10411)
|
2025-09-14 10:18:23 +08:00 |
|
Sundara Raman Ramachandran
|
94d0f656fb
|
[Performance] Dynamic Batch Tokenizer (#9382)
|
2025-09-14 01:56:04 +08:00 |
|
Even Zhou
|
16cd550c85
|
Support Qwen3-Next on Ascend NPU (#10379)
|
2025-09-12 16:31:37 -07:00 |
|
amysaq2023
|
30d20ce84f
|
Support loading weights from remote instance (#8215)
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Co-authored-by: Chayenne <74843776+zhaochenyang20@users.noreply.github.com>
|
2025-09-12 17:40:22 +08:00 |
|
Yuan Luo
|
24dc2bee97
|
Fix Bailing MoE model bugs (#10362)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: 羽癫 <yudian.zy@antgroup.com>
|
2025-09-12 00:36:02 -07:00 |
|
strgrb
|
fac07c9b08
|
Support LingV2 model (#10359)
Co-authored-by: 羽癫 <yudian.zy@antgroup.com>
Co-authored-by: guoyuhong <yuhong.gyh@antgroup.com>
|
2025-09-11 23:53:52 -07:00 |
|
huangtingwei
|
b4c2c421e9
|
support memory_pool_host page first direct layout (#10031)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-09-11 23:19:44 -07:00 |
|
Chang Su
|
53ca15529a
|
Implement Standalone gRPC Server for SGLang Python Scheduler (#10283)
|
2025-09-11 20:57:17 -07:00 |
|
Shu Wang
|
3df05f4d6a
|
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199)
|
2025-09-11 20:18:43 -07:00 |
|
Lianmin Zheng
|
144ee5f37c
|
[Auto Sync] Update server_args.py (20250912) (#10347)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Kan Wu <wukanustc@gmail.com>
|
2025-09-11 19:18:07 -07:00 |
|
Lianmin Zheng
|
64f296f8e6
|
[Minor] Improve the style of server args (#10328)
|
2025-09-11 07:06:29 -07:00 |
|
Yi Zhang
|
30c6e1f569
|
Qwen3-Next support (#10233)
Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
|
2025-09-11 04:11:49 -07:00 |
|
Pavani Majety
|
21176b0093
|
[Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-09-10 12:00:23 -07:00 |
|
Yiyu Liu
|
737d73ed5b
|
Fix: the default choice is wrong for flashinfer mxfp4 moe precision (#10253)
|
2025-09-10 12:10:38 +08:00 |
|
Lianmin Zheng
|
676a7b51bd
|
make --speculative-draft-model an alias of --speculative-draft-model-path (#10246)
|
2025-09-09 19:12:24 -07:00 |
|
Liangsheng Yin
|
83d55ac51f
|
[1/N]DP refactor: Improve dp rank scheduling in PD disaggregation mode. (#10169)
|
2025-09-09 12:27:55 +08:00 |
|
Baizhou Zhang
|
8ad700f735
|
Cleaning codes for speculative attention mode (#10149)
|
2025-09-08 17:38:06 -07:00 |
|
Yineng Zhang
|
b7d1f17b8d
|
Revert "enable auto-round quantization model (#6226)" (#10148)
|
2025-09-07 22:31:11 -07:00 |
|
Weiwei
|
c8295d2353
|
enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
|
2025-09-07 22:05:35 -07:00 |
|
cicirori
|
8c5930f08a
|
Add speculator attention backend switch (#9981)
|
2025-09-07 21:44:36 -07:00 |
|
Qiaolin Yu
|
8cda5a622c
|
Standalone speculative decoding (#10090)
|
2025-09-07 20:55:09 -07:00 |
|
Yuwei An
|
9a7ced4e4d
|
[Feature] LMCache Connector Integration (#9741)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-09-06 20:14:55 -07:00 |
|
Lianmin Zheng
|
f3b6760213
|
[Auto Sync] Update server_args.py (20250906) (#10117)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-09-06 16:59:36 -07:00 |
|
DevashishLal-CB
|
13705dae06
|
[Fix] Add speculative_draft_model_revision to server_args (#5255)
Signed-off-by: Devashish Lal <devashish@rivosinc.com>
|
2025-09-05 19:45:46 +08:00 |
|
fzyzcjy
|
df97b31f37
|
Tiny support setting numa nodes for different ranks (#10006)
|
2025-09-05 19:01:27 +08:00 |
|
Liangsheng Yin
|
6e95f5e5bd
|
Simplify Router arguments passing and build it in docker image (#9964)
|
2025-09-05 12:13:55 +08:00 |
|
Yingchun Lai
|
b32ab0705e
|
metrics: support customer buckets for prompt/generation_tokens_histogram (#9634)
|
2025-09-04 22:22:08 +08:00 |
|
Hubert Lu
|
2c562fd2d0
|
Fix Llama 4 with MXFP4 dynamic quant on MI35x (#9993)
|
2025-09-04 00:48:58 -07:00 |
|
Yineng Zhang
|
2c7ca33abb
|
Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" (#9955)
|
2025-09-02 23:49:56 -07:00 |
|
Al-Ekram Elahee Hridoy
|
6243c36702
|
[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)
|
2025-09-02 19:31:15 -07:00 |
|
Lianmin Zheng
|
60e37f8028
|
Move parsers under a single folder (#9912)
|
2025-09-02 18:25:04 -07:00 |
|
ybyang
|
5f77e1292d
|
Support Multi Process Tokenizer Manager(#6555) (#8964)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-09-01 01:00:13 -07:00 |
|
hzh0425
|
c2a26e725c
|
feature(eplb): add min-rebalancing-utilization-threshold for eplb (#8345)
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-08-30 11:24:29 +08:00 |
|
Liangsheng Yin
|
a23c30205d
|
Raise error when topk>1 and page>1 for paged attention backends. (#9784)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-29 12:47:34 +08:00 |
|
Zhiqiang Xie
|
001f51940a
|
[HiCache] change the default policy to write through (#9772)
|
2025-08-28 18:28:39 -07:00 |
|
Lianmin Zheng
|
fce7ae33f8
|
[Sync] Update server_args.py (20250828) (#9745)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-08-28 10:33:00 -07:00 |
|
Lianmin Zheng
|
fd71b11b1d
|
move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679)
|
2025-08-27 03:34:29 -07:00 |
|