Commit Graph

480 Commits

Author SHA1 Message Date
JieXin Liang
9e93ef3f8e [fix] fix illegal mem access and clean up triton attention backend (#4571) 2025-03-20 02:01:52 -07:00
Jinyan Chen
f44db16c8e [Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
2025-03-19 08:16:31 -07:00
JieXin Liang
c0e9a36c5f Optimize Triton decoding kernel for dynamic workload (#4553) 2025-03-18 21:25:38 -07:00
aoshen524
588865f0e0 [Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-03-18 20:33:07 -07:00
Cheng Wan
3196999f63 Reduce computation and communication in DP attention (#4521) 2025-03-18 13:41:36 -07:00
James Liu
9e0186f352 [Feature] Support EAGLE 3 (#4247) 2025-03-18 07:35:23 -07:00
Yineng Zhang
c787298547 use sgl custom all reduce (#4441) 2025-03-18 00:46:41 -07:00
Ke Bao
45212ce18b Add deepseek v2 torch compile pr test (#4538) 2025-03-18 00:29:24 -07:00
Mick
d373a48c98 fix: second_per_grid_ts should be used to get mrope position (#3682) 2025-03-17 18:12:38 -07:00
Zhiqiang Xie
a98290aea3 Unit test for Hierarchical Caching (#4486) 2025-03-17 17:45:00 -07:00
Lianmin Zheng
82dec1f70b Remove redundant type conversion (#4513) 2025-03-17 05:57:35 -07:00
Lianmin Zheng
5493c3343e Fix data parallel + tensor parallel (#4499) 2025-03-17 05:13:16 -07:00
Wei Wu
91ba98fe50 [Fix] Resolve GPU Memory Leak in update_weights_from_tensor (#4446) 2025-03-17 08:54:30 +00:00
Xihuai Wang
927ca935a7 Constraint Decoding: Tool call with text (#4067) 2025-03-17 01:06:46 -07:00
Stefan He
ef3c2dd08e Support Online Quantization for W8A8 (#4485) 2025-03-17 00:28:56 -07:00
萝卜菜
d6d21640d3 [Feature] Support Deepseek-VL2 (#2798)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-03-16 23:07:59 -07:00
Byron Hsu
8cc300f536 Fix router test (#4483) 2025-03-16 22:49:47 -07:00
Rin Intachuen
d1112d8548 Add endpoint for file support, purely to speed up processing of input_embeds. (#2797) 2025-03-16 18:30:37 -07:00
woodx
48efec7b05 Feature: support code completion (#3612) 2025-03-16 18:26:19 -07:00
Mick
9d02bb3e2a Urgent model support: support gemma-3-it (#4424) 2025-03-16 17:37:32 -07:00
Ying Sheng
1b859295f4 [Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-16 02:48:55 -07:00
Lianmin Zheng
2c4f5ccac1 Fix minor style (#4460) 2025-03-15 21:51:12 -07:00
lukec
21d485f835 Fix test_create_kvindices unit test (#4452) 2025-03-15 16:01:04 -07:00
Lu Changqi
0e0ec70200 Hierarchical Caching supports MLA (#4009)
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-03-13 20:42:14 -07:00
Lianmin Zheng
f0afaf5289 Add a dummy grok test case (#4399) 2025-03-13 15:29:48 -07:00
Qiaolin Yu
85d2365d33 Fix the output of hidden states after HTTP requests (#4269) 2025-03-13 14:54:06 -07:00
Lianmin Zheng
a5a892ffd3 Fix auto merge & add back get_flat_data_by_layer (#4393) 2025-03-13 08:46:25 -07:00
Lianmin Zheng
8e66fbecee Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-03-13 08:23:56 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Lianmin Zheng
d40ee62b5d Update nightly tests (#4352) 2025-03-12 15:36:13 -07:00
Mick
01090e8ac3 model: Support Janus-pro (#3203) 2025-03-12 11:02:11 -07:00
Mick
ff2ce0b86f refactor: move image processors to separate files (#4229) 2025-03-11 12:35:35 -07:00
lukec
dce303e279 linear support deepgemm (#4199)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-11 00:38:37 -07:00
Lianmin Zheng
5524e7d057 Fix nightly eval for neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 (#4279) 2025-03-10 16:50:28 -07:00
HandH1998
2ac189edc8 Amd test fp8 (#4261) 2025-03-10 10:12:09 -07:00
Lianmin Zheng
00d25a7f5e Fix quantization and nightly tests (#4258) 2025-03-10 03:06:21 -07:00
Lianmin Zheng
aa957102a9 Simplify tests & Fix trtllm custom allreduce registration (#4252) 2025-03-10 01:24:22 -07:00
Lianmin Zheng
e8a69e4d0c Clean up fp8 support (#4230) 2025-03-09 21:46:35 -07:00
Lianmin Zheng
fbd560028a Auto balance CI tests (#4238) 2025-03-09 21:05:55 -07:00
Lianmin Zheng
730d084f2a Minor style fix for sgl-kernel (#4243) 2025-03-09 20:15:13 -07:00
Baizhou Zhang
9dfafa743c Fix test of flashinfer mla with nextn (#4237) 2025-03-09 12:45:39 -07:00
Baizhou Zhang
9fb48f951f Support nextn for flashinfer mla attention backend (#4218) 2025-03-09 00:01:54 -08:00
Lianmin Zheng
48473684cc Split test_mla.py into two files (#4216) 2025-03-08 15:40:49 -08:00
Lianmin Zheng
2cadd51d11 Test no vllm custom allreduce (#4210) 2025-03-08 05:23:06 -08:00
Lianmin Zheng
08c4d764a5 lazy import attn backends (#4200) 2025-03-08 00:41:35 -08:00
Lianmin Zheng
d4017a6b63 [EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-07 22:12:13 -08:00
HandH1998
c7f254468f [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: b0urnee <2769086541@qq.com>
2025-03-06 20:54:52 -08:00
Pan Lyu
361971b859 Add Support for Qwen2-VL Multi-modal Embedding Models (#3694) 2025-03-06 16:46:20 -08:00
Lianmin Zheng
bc1534ff32 Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-06 06:13:59 -08:00
Lianmin Zheng
fcc2e37f69 Split the __init__ of scheduler as smaller functions. Improve the eagle tests (#4128) 2025-03-06 00:13:20 -08:00