Commit Graph

1582 Commits

Author SHA1 Message Date
Conghui Tan
6412c5e493 Avoid duplicated request ids in batch APIs (#4026)
Co-authored-by: conghuitan <conghuitan@tencent.com>
2025-03-12 21:38:17 -07:00
AniZpZ
85ef7f64e4 [FIX] fix incorrect output when enable both deepgemm and torch compile (#4359)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
2025-03-12 21:34:09 -07:00
Chen Shengzhi
f1cf6eefbe [Fix] Check the device backend before calling empty_cache function (#4212) 2025-03-12 21:28:48 -07:00
Wang Ran (汪然)
aff79f101f simple bugfix (#4342) 2025-03-12 21:20:18 -07:00
William
56c39a05a2 Remove the choices in --speculative-eagle-topk argument (#4329) 2025-03-12 21:19:16 -07:00
文峰
c550e52f8b Fix scheduler proctitle suffix is ​​None (#4326)
Co-authored-by: wenfeng.wf <wenfeng.wf@alibaba-inc.com>
2025-03-12 19:29:35 -07:00
Lianmin Zheng
e35a93fa8a Move output processing logic from scheduler.py into a separate file (#4354) 2025-03-12 16:21:49 -07:00
Lianmin Zheng
d40ee62b5d Update nightly tests (#4352) 2025-03-12 15:36:13 -07:00
Wang Ran (汪然)
91b19949d7 typo: Update http_server.py (#4350) 2025-03-12 15:05:30 -07:00
Zhiqiang Xie
10b544ae9b Hierarchical Caching Refactoring and Fixing TP issue (#4082) 2025-03-12 11:22:35 -07:00
Mick
01090e8ac3 model: Support Janus-pro (#3203) 2025-03-12 11:02:11 -07:00
yych0745
6f43a9b9f4 remove the unused readline dependency from the Qwen2 model implementa… (#4340) 2025-03-12 02:47:27 -07:00
JieXin Liang
0540fef7a1 [Fix] fix _yarn_linear_ramp_mask with device parameter (#4337) 2025-03-12 02:28:19 -07:00
lambert0312
481f608b8e Add INT8 support MTP NextN function (#3911) 2025-03-12 01:37:16 -07:00
Yineng Zhang
ed91561f79 upgrade sgl-kernel 0.0.4.post3 (#4334) 2025-03-12 01:36:41 -07:00
Stefan He
e0917e6bd0 Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215)
Co-authored-by: Stefan He <bhe@linkedin.com>
2025-03-12 00:08:03 -07:00
lambert0312
7140ba3573 Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4323) 2025-03-11 18:25:56 -07:00
Yineng Zhang
d1da58e275 unify is_cuda and is_hip (#4321) 2025-03-11 18:12:56 -07:00
Yineng Zhang
1cf63485c1 upgrade flashinfer 0.2.3 (#4317)
Co-authored-by:  qingquansong <qsong@linkedin.com>
2025-03-11 15:37:17 -07:00
Mick
ff2ce0b86f refactor: move image processors to separate files (#4229) 2025-03-11 12:35:35 -07:00
Ximingwang-09
0f2a2e3c19 Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) (#4220)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-03-11 12:32:33 -07:00
yigex
690e1f2371 [AMD] Fix rocm sgl-kernel missing modules error (#4311)
Co-authored-by: yiakwy-xpu-ml-framework-team <leiwang2@amd.com>
2025-03-11 10:35:28 -07:00
yych0745
6a02b32d07 Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4287)
Co-authored-by: HandH1998 <1335248067@qq.com>
2025-03-11 00:49:06 -07:00
lukec
dce303e279 linear support deepgemm (#4199)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-11 00:38:37 -07:00
Yineng Zhang
4d27eb9ad1 update sgl-kernel 0.0.4.post2 (#4291) 2025-03-11 00:34:33 -07:00
lambert0312
d3ecd63204 Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) (#4136) 2025-03-11 00:32:25 -07:00
Yineng Zhang
e187a3d595 upgrade xgrammar 0.1.15 (#4275) 2025-03-10 14:53:24 -07:00
HandH1998
2ac189edc8 Amd test fp8 (#4261) 2025-03-10 10:12:09 -07:00
Lianmin Zheng
5a6400eec5 Test no vllm custom allreduce (#4256) 2025-03-10 10:08:25 -07:00
Lianmin Zheng
00d25a7f5e Fix quantization and nightly tests (#4258) 2025-03-10 03:06:21 -07:00
shimin
ac69885056 fix the input_ids is None error (#4144) 2025-03-10 01:38:37 -07:00
Lianmin Zheng
aa957102a9 Simplify tests & Fix trtllm custom allreduce registration (#4252) 2025-03-10 01:24:22 -07:00
DavidChan
4455b26e76 [Bug fixed] fixed the crash when enable the dp-attention on the single card (#3958) 2025-03-10 00:50:34 -07:00
Lianmin Zheng
e8a69e4d0c Clean up fp8 support (#4230) 2025-03-09 21:46:35 -07:00
Lianmin Zheng
fbd560028a Auto balance CI tests (#4238) 2025-03-09 21:05:55 -07:00
Lianmin Zheng
730d084f2a Minor style fix for sgl-kernel (#4243) 2025-03-09 20:15:13 -07:00
Lianmin Zheng
4a05bdfa86 Revert "Check eagle server args" (#4242) 2025-03-09 18:53:33 -07:00
Ying Sheng
34c8898755 Check eagle server args (#4217) 2025-03-09 01:10:43 -08:00
HandH1998
0dd6cda288 Apply sgl w8a8 fp8 kernel (#3148) 2025-03-09 00:03:32 -08:00
Baizhou Zhang
9fb48f951f Support nextn for flashinfer mla attention backend (#4218) 2025-03-09 00:01:54 -08:00
Yineng Zhang
89ccb533ad use sgl-kernel 0.0.4 (#4224) 2025-03-08 23:43:09 -08:00
Lianmin Zheng
1361ab9e03 Lazily import lora backends (#4225) 2025-03-08 23:39:26 -08:00
Lianmin Zheng
8abf74e3c9 Rename files in sgl kernel to avoid nested folder structure (#4213)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-03-08 22:54:51 -08:00
Mingshan
0fe7c13be1 Fix bench_serving flush cache not recognizing OPENAI_API_KEY (#4181)
Signed-off-by: Mingshan <git@brighill.com>
2025-03-08 01:03:38 -08:00
Lianmin Zheng
08c4d764a5 lazy import attn backends (#4200) 2025-03-08 00:41:35 -08:00
Lianmin Zheng
d4017a6b63 [EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-07 22:12:13 -08:00
Lianmin Zheng
d052f4c8a9 New clang format for sgl kernel (#4194) 2025-03-07 20:21:08 -08:00
Ke Bao
20c8119915 Fix eagle hang issue for max_new_tokens=1 (#4185) 2025-03-07 12:11:18 -08:00
Yineng Zhang
eb61f5c9af Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186) 2025-03-07 10:27:52 -08:00
HAI
0beea4503f ROCm: Flex Attention Enablement with custom backends (#4178)
Co-authored-by: linsun12 <linsun12@amd.com>
2025-03-07 04:38:53 -08:00