sglang

Author	SHA1	Message	Date
Zhiyu	80b2b3207a	Enable native ModelOpt quantization support (3/3) (#10154 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-10-21 21:44:29 -07:00
Kai-Hsun Chen	c61b0b294c	[quantization][MoE] fix the check for `tp_size` / `moe_ep_size` / `moe_intermediate_size` / `weight_block_size_n` (#11702 ) Signed-off-by: Kai-Hsun Chen <khchen@x.ai>	2025-10-21 21:25:28 +08:00
Meng, Hengyu	b113c72e7a	Init attention backend for Intel XPU (#10656 ) Co-authored-by: guangyey <guangye.yu@intel.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>	2025-10-21 11:41:28 +08:00
harrisonlimh	c726d44cc7	Recapture cuda graph after model weight update to resolve IMA error (#11780 )	2025-10-20 10:50:03 +08:00
Liangsheng Yin	57e25de756	Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" (#11827 )	2025-10-19 19:44:06 +08:00
YAMY	80407b0493	Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (#10788 )	2025-10-19 11:37:43 +08:00
b8zhong	f9a7d9b3dc	support server arg override KV cache to bf16 to avoid slow cases (#11749 )	2025-10-19 02:49:48 +08:00
Lianmin Zheng	67e34c56d7	Fix install instructions and pyproject.tomls (#11781 )	2025-10-18 01:08:01 -07:00
Cheng Wan	5b214b50b6	[Refactor] move `deep_gemm_wrapper` out of `quantization` (#11784 )	2025-10-17 18:57:54 -07:00
Chang Su	627974405d	[Lint] Add `python/sglang` to ruff F401 checks and remove unused imports in files (#11685 )	2025-10-17 16:49:46 -07:00
ykcombat	f440baa136	[Feature] Reuse flashinfer workspace for PD-Multiplexing. (#11540 )	2025-10-18 02:35:06 +08:00
Shangming Cai	868403f642	[PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912 ) Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com>	2025-10-15 18:59:14 -07:00
Lianmin Zheng	cd7e1bd591	Sync code and test CI; rename some env vars (#11686 )	2025-10-15 18:37:03 -07:00
Yineng Zhang	91fc5bb5a9	feat: add add_chunked_prefix_cache_attention_backend (#11636 )	2025-10-14 21:48:13 -07:00
Xun Sun	a40229f6f8	[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423 ) Co-authored-by: Hank Han <hanhan7630@outlook.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-14 19:40:54 -07:00
Baizhou Zhang	c224a4c6cc	Fix log for chunked prefix cache (#11624 )	2025-10-14 11:49:33 -07:00
Lianmin Zheng	5e3f7e7fa9	Minor: improve sampler & remove unused fields from model_config.py (#11531 )	2025-10-13 11:04:44 -07:00
Mick	f35f120d70	fix: fix video input for qwen3-vl (#11442 )	2025-10-13 09:30:43 -07:00
Liangsheng Yin	516738b096	Depreate `global_server_args_dict` (#11528 )	2025-10-13 19:34:43 +08:00
Yi Zhang	a55cf5304a	[Feature] Support mamba radix cache v0 (#11214 ) Co-authored-by: hanming-lu <hanming@x.ai> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: thalahors <ericalcaide1@gmail.com>	2025-10-12 20:57:15 -07:00
Cheng Wan	1bdd010291	Revert "Deprecate `global_server_args_dict`" (#11520 )	2025-10-12 17:40:40 -07:00
Liangsheng Yin	1083e7e3df	Deprecate `global_server_args_dict` (#11331 )	2025-10-13 01:20:47 +08:00
Yuwei An	4ac8e09df0	Piecewise CUDA Graph Support & Torch Compile Backend (#10062 ) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>	2025-10-12 11:55:57 +08:00
Lianmin Zheng	b4408e6098	Revert "fix: fix video input for qwen3-vl" (#11437 )	2025-10-10 12:44:40 -07:00
Mick	a1a20b4c7c	fix: fix video input for qwen3-vl (#11361 )	2025-10-10 04:35:35 -07:00
Lianmin Zheng	9b8ebb2798	move more files under srt/utils (#11285 )	2025-10-09 16:46:15 -07:00
Trevor Morris	a4b424c632	[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (#11309 )	2025-10-08 23:59:46 -07:00
Netanel Haber	d6837aea4d	model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-10-09 00:37:38 +08:00
Lifu Huang	edefab0c64	[2/2] Support MHA prefill with FlashAttention 4. (#10937 ) Co-authored-by: Hieu Pham <hyhieu@gmail.com>	2025-10-08 00:54:20 -07:00
YAMY	5a9170d993	Optimize copy_kv_cache for spec decoding (#11126 ) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2025-10-08 10:43:30 +08:00
Liangsheng Yin	501dfa6b42	Remove sampling info events and overlap thread file (#11300 )	2025-10-07 21:34:25 +08:00
Zhiyu	155cbb51f0	Enable native ModelOpt quantization support (1/3) (#7149 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-10-06 13:24:15 -07:00
fzyzcjy	efbc687c28	Support DeepSeek V3.2 Exp (#11061 ) Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com> Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com> Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com> Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Zhengda Qin <zhengdqin@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-10-06 00:24:15 -07:00
Yuan Luo	590f2da052	[Feat] Support Torch Symm Mem AllReduce (#10571 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-05 13:55:19 -07:00
fzyzcjy	fdc4e1e570	Tiny move files to utils folder (#11166 )	2025-10-03 22:40:06 +08:00
Matt Nappo	8c57490210	[Feature] Option to save model weights to CPU when memory saver mode is enabled (#10873 ) Co-authored-by: molocule <34072934+molocule@users.noreply.github.com>	2025-10-03 16:48:19 +08:00
jacky.cheng	b00a0c786f	[Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (#11161 )	2025-10-02 21:19:30 -07:00
Dom Brown	e810077488	Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (#11138 )	2025-10-02 16:04:58 -07:00
ilyasch2	083629c235	[model] Add mamba2 and Falcon-H1 support. (#10988 ) Co-authored-by: Younes Belkada <younes.belkada@tii.ae> Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>	2025-10-02 19:15:36 +08:00
fzyzcjy	2ac453b07f	Tiny detect slow ranks (#10508 )	2025-10-02 18:00:33 +08:00
li-kesen	2bc61dd194	Remove hybrid_linear_attn attention backend and refactor attention registry (#10816 ) Co-authored-by: Yi Zhang <1109276519@qq.com>	2025-09-30 10:16:16 +08:00
Lianmin Zheng	dda34c2f93	Fix mem fraction static for nightly tests (#11076 )	2025-09-29 12:57:41 -07:00
amysaq2023	2bdaf482f9	refactor loading weights from remote instance coding format (#10941 ) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>	2025-09-26 15:25:39 -07:00
ronnie_zheng	e22f3a5ec9	[Ascend]optimize Qwen3 on Ascend (#10574 ) Co-authored-by: c30031083 <chenxu140@huawei.com>	2025-09-22 17:18:36 -07:00
Qiaolin Yu	e2ac7888b8	[2/2] Support deterministic inference for temperature > 0 (#10678 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-09-21 19:36:08 -07:00
Stefan He	86527a4799	[deterministic inference] Move batch invariant pkg to sglang (#10695 )	2025-09-21 19:35:14 -07:00
Lifu Huang	08ecd0aa2a	[3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization (#10592 )	2025-09-20 22:47:48 -07:00
Baizhou Zhang	8ecef73f12	[1/2] Support deterministic inference with flashinfer attention backend (#10645 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2025-09-19 23:34:29 -07:00
Yineng Zhang	60e2a7cead	[Auto Sync] Update model_runner.py (20250920) (#10679 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-09-19 18:26:54 -07:00
Zhihao Zhang	e7bc600304	[Feature] Speculative decoding support lookahead (#9873 ) Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2025-09-18 16:42:41 -07:00

1 2 3 4 5 ...

404 Commits