Commit Graph

1594 Commits

Author SHA1 Message Date
Lianmin Zheng
4fea040ca1 Fix a regression introduced by overlapping KV cache writing (#4375) 2025-03-13 03:49:05 -07:00
Yineng Zhang
6aaeb84872 chore: bump v0.4.4 (#4041) 2025-03-13 02:49:58 -07:00
Yineng Zhang
3623b6a7f5 upgrade sgl-kernel 0.0.5 (#4381) 2025-03-13 02:37:56 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Cheng Wan
2f6bacee03 [moe] fix: correct the cache size in the last chunk (#3679)
Co-authored-by: Abatom <abzhonghua@gmail.com>
2025-03-12 22:22:13 -07:00
Wen Sun
4014804157 Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation (#3814) 2025-03-12 22:12:55 -07:00
David Carreto Fidalgo
f7f88b706c HotFix: json serialization error when using OAI v1/batches endpoint with logprobs (#3896) 2025-03-12 22:04:29 -07:00
yiakwy-xpu-ml-framework-team
18c27131f5 [tools] add fp8 max/min constant in utils (#3959) 2025-03-12 21:44:55 -07:00
YR Chen
ccdd10c84b Move aiohttp into public dependencies (#3980) 2025-03-12 21:42:57 -07:00
vikram singh shekhawat
76f6c0ebf9 Add device detection and count functions to utils. (#3962) 2025-03-12 21:41:50 -07:00
Conghui Tan
6412c5e493 Avoid duplicated request ids in batch APIs (#4026)
Co-authored-by: conghuitan <conghuitan@tencent.com>
2025-03-12 21:38:17 -07:00
AniZpZ
85ef7f64e4 [FIX] fix incorrect output when enable both deepgemm and torch compile (#4359)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
2025-03-12 21:34:09 -07:00
Chen Shengzhi
f1cf6eefbe [Fix] Check the device backend before calling empty_cache function (#4212) 2025-03-12 21:28:48 -07:00
Wang Ran (汪然)
aff79f101f simple bugfix (#4342) 2025-03-12 21:20:18 -07:00
William
56c39a05a2 Remove the choices in --speculative-eagle-topk argument (#4329) 2025-03-12 21:19:16 -07:00
文峰
c550e52f8b Fix scheduler proctitle suffix is ​​None (#4326)
Co-authored-by: wenfeng.wf <wenfeng.wf@alibaba-inc.com>
2025-03-12 19:29:35 -07:00
Lianmin Zheng
e35a93fa8a Move output processing logic from scheduler.py into a separate file (#4354) 2025-03-12 16:21:49 -07:00
Lianmin Zheng
d40ee62b5d Update nightly tests (#4352) 2025-03-12 15:36:13 -07:00
Wang Ran (汪然)
91b19949d7 typo: Update http_server.py (#4350) 2025-03-12 15:05:30 -07:00
Zhiqiang Xie
10b544ae9b Hierarchical Caching Refactoring and Fixing TP issue (#4082) 2025-03-12 11:22:35 -07:00
Mick
01090e8ac3 model: Support Janus-pro (#3203) 2025-03-12 11:02:11 -07:00
yych0745
6f43a9b9f4 remove the unused readline dependency from the Qwen2 model implementa… (#4340) 2025-03-12 02:47:27 -07:00
JieXin Liang
0540fef7a1 [Fix] fix _yarn_linear_ramp_mask with device parameter (#4337) 2025-03-12 02:28:19 -07:00
lambert0312
481f608b8e Add INT8 support MTP NextN function (#3911) 2025-03-12 01:37:16 -07:00
Yineng Zhang
ed91561f79 upgrade sgl-kernel 0.0.4.post3 (#4334) 2025-03-12 01:36:41 -07:00
Stefan He
e0917e6bd0 Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215)
Co-authored-by: Stefan He <bhe@linkedin.com>
2025-03-12 00:08:03 -07:00
lambert0312
7140ba3573 Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4323) 2025-03-11 18:25:56 -07:00
Yineng Zhang
d1da58e275 unify is_cuda and is_hip (#4321) 2025-03-11 18:12:56 -07:00
Yineng Zhang
1cf63485c1 upgrade flashinfer 0.2.3 (#4317)
Co-authored-by:  qingquansong <qsong@linkedin.com>
2025-03-11 15:37:17 -07:00
Mick
ff2ce0b86f refactor: move image processors to separate files (#4229) 2025-03-11 12:35:35 -07:00
Ximingwang-09
0f2a2e3c19 Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) (#4220)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-03-11 12:32:33 -07:00
yigex
690e1f2371 [AMD] Fix rocm sgl-kernel missing modules error (#4311)
Co-authored-by: yiakwy-xpu-ml-framework-team <leiwang2@amd.com>
2025-03-11 10:35:28 -07:00
yych0745
6a02b32d07 Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4287)
Co-authored-by: HandH1998 <1335248067@qq.com>
2025-03-11 00:49:06 -07:00
lukec
dce303e279 linear support deepgemm (#4199)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-11 00:38:37 -07:00
Yineng Zhang
4d27eb9ad1 update sgl-kernel 0.0.4.post2 (#4291) 2025-03-11 00:34:33 -07:00
lambert0312
d3ecd63204 Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) (#4136) 2025-03-11 00:32:25 -07:00
Yineng Zhang
e187a3d595 upgrade xgrammar 0.1.15 (#4275) 2025-03-10 14:53:24 -07:00
HandH1998
2ac189edc8 Amd test fp8 (#4261) 2025-03-10 10:12:09 -07:00
Lianmin Zheng
5a6400eec5 Test no vllm custom allreduce (#4256) 2025-03-10 10:08:25 -07:00
Lianmin Zheng
00d25a7f5e Fix quantization and nightly tests (#4258) 2025-03-10 03:06:21 -07:00
shimin
ac69885056 fix the input_ids is None error (#4144) 2025-03-10 01:38:37 -07:00
Lianmin Zheng
aa957102a9 Simplify tests & Fix trtllm custom allreduce registration (#4252) 2025-03-10 01:24:22 -07:00
DavidChan
4455b26e76 [Bug fixed] fixed the crash when enable the dp-attention on the single card (#3958) 2025-03-10 00:50:34 -07:00
Lianmin Zheng
e8a69e4d0c Clean up fp8 support (#4230) 2025-03-09 21:46:35 -07:00
Lianmin Zheng
fbd560028a Auto balance CI tests (#4238) 2025-03-09 21:05:55 -07:00
Lianmin Zheng
730d084f2a Minor style fix for sgl-kernel (#4243) 2025-03-09 20:15:13 -07:00
Lianmin Zheng
4a05bdfa86 Revert "Check eagle server args" (#4242) 2025-03-09 18:53:33 -07:00
Ying Sheng
34c8898755 Check eagle server args (#4217) 2025-03-09 01:10:43 -08:00