Commit Graph

2404 Commits

Author SHA1 Message Date
Lianmin Zheng
a5a892ffd3 Fix auto merge & add back get_flat_data_by_layer (#4393) 2025-03-13 08:46:25 -07:00
Lianmin Zheng
8e66fbecee Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-03-13 08:23:56 -07:00
Lianmin Zheng
f141298a3c Update ci_install_dependency.sh to use accelerate 1.4.0 (#4392)
Co-authored-by: wangyu <wangyu.steph@bytedance.com>
Co-authored-by: wangyu <yuwangauto@foxmail.com>
2025-03-13 07:16:11 -07:00
Lianmin Zheng
4fea040ca1 Fix a regression introduced by overlapping KV cache writing (#4375) 2025-03-13 03:49:05 -07:00
Yineng Zhang
6aaeb84872 chore: bump v0.4.4 (#4041) 2025-03-13 02:49:58 -07:00
Yineng Zhang
3623b6a7f5 upgrade sgl-kernel 0.0.5 (#4381) 2025-03-13 02:37:56 -07:00
Yineng Zhang
4ff1264201 Update pyproject.toml 2025-03-13 02:16:51 -07:00
Yineng Zhang
2a4cbad8e9 bump 0.0.5 sgl-kernel (#4377) 2025-03-13 02:08:35 -07:00
Yineng Zhang
2937387a50 fix accuracy issue (#4376) 2025-03-13 02:06:22 -07:00
yuhui
cf721fdece Update grafana.json (#4374) 2025-03-13 01:31:33 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Cheng Wan
2f6bacee03 [moe] fix: correct the cache size in the last chunk (#3679)
Co-authored-by: Abatom <abzhonghua@gmail.com>
2025-03-12 22:22:13 -07:00
Wen Sun
4014804157 Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation (#3814) 2025-03-12 22:12:55 -07:00
yang_zcybb
ad46550d25 [Doc] Fix typo in backend/sampling_params (#3835)
Co-authored-by: yangzhice.124 <yangzhice.124@bytedance.com>
2025-03-12 22:12:14 -07:00
Jun Liu
14344caa38 [docs] Update outdated description about torch.compile (#3844) 2025-03-12 22:09:38 -07:00
David Carreto Fidalgo
f7f88b706c HotFix: json serialization error when using OAI v1/batches endpoint with logprobs (#3896) 2025-03-12 22:04:29 -07:00
yiakwy-xpu-ml-framework-team
18c27131f5 [tools] add fp8 max/min constant in utils (#3959) 2025-03-12 21:44:55 -07:00
YR Chen
ccdd10c84b Move aiohttp into public dependencies (#3980) 2025-03-12 21:42:57 -07:00
vikram singh shekhawat
76f6c0ebf9 Add device detection and count functions to utils. (#3962) 2025-03-12 21:41:50 -07:00
Chitsing KUI
959a3143fc example: add async offline inference demo (#3961)
Signed-off-by: joeshikui <joeshikui@tencent.com>
Co-authored-by: joeshikui <joeshikui@tencent.com>
2025-03-12 21:41:21 -07:00
Conghui Tan
6412c5e493 Avoid duplicated request ids in batch APIs (#4026)
Co-authored-by: conghuitan <conghuitan@tencent.com>
2025-03-12 21:38:17 -07:00
laixin
0c02086015 add INT8 example into dsv3 README (#4079) 2025-03-12 21:37:30 -07:00
AniZpZ
85ef7f64e4 [FIX] fix incorrect output when enable both deepgemm and torch compile (#4359)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
2025-03-12 21:34:09 -07:00
Chen Shengzhi
f1cf6eefbe [Fix] Check the device backend before calling empty_cache function (#4212) 2025-03-12 21:28:48 -07:00
William
0a59a4657a Fix the doc of FR-Spec (#4295) 2025-03-12 21:22:50 -07:00
Wang Ran (汪然)
aff79f101f simple bugfix (#4342) 2025-03-12 21:20:18 -07:00
Peter Pan
016033188c docs: add parameter --log-requests-level (#4335) 2025-03-12 21:19:37 -07:00
William
56c39a05a2 Remove the choices in --speculative-eagle-topk argument (#4329) 2025-03-12 21:19:16 -07:00
Qingquan Song
4068e01292 Fix per token fp8 quant precision (#4362) 2025-03-12 21:19:05 -07:00
Shi Shuai
817d43705c feat: support ep size < 32 for sgl kernel (#4348) 2025-03-12 20:50:46 -07:00
文峰
c550e52f8b Fix scheduler proctitle suffix is ​​None (#4326)
Co-authored-by: wenfeng.wf <wenfeng.wf@alibaba-inc.com>
2025-03-12 19:29:35 -07:00
Lianmin Zheng
e35a93fa8a Move output processing logic from scheduler.py into a separate file (#4354) 2025-03-12 16:21:49 -07:00
shizhediao
2c3656f276 [Fix Doc.] Enable internal forwarding when starting the router (#4355) 2025-03-12 15:53:26 -07:00
Lianmin Zheng
d40ee62b5d Update nightly tests (#4352) 2025-03-12 15:36:13 -07:00
Wang Ran (汪然)
91b19949d7 typo: Update http_server.py (#4350) 2025-03-12 15:05:30 -07:00
Elfie Guo
7c86671131 Support Blackwell Block Scale FP8 Gemm (#4278) 2025-03-12 14:17:11 -07:00
Zhiqiang Xie
10b544ae9b Hierarchical Caching Refactoring and Fixing TP issue (#4082) 2025-03-12 11:22:35 -07:00
Mick
01090e8ac3 model: Support Janus-pro (#3203) 2025-03-12 11:02:11 -07:00
yych0745
6f43a9b9f4 remove the unused readline dependency from the Qwen2 model implementa… (#4340) 2025-03-12 02:47:27 -07:00
JieXin Liang
0540fef7a1 [Fix] fix _yarn_linear_ramp_mask with device parameter (#4337) 2025-03-12 02:28:19 -07:00
lambert0312
481f608b8e Add INT8 support MTP NextN function (#3911) 2025-03-12 01:37:16 -07:00
Yineng Zhang
ed91561f79 upgrade sgl-kernel 0.0.4.post3 (#4334) 2025-03-12 01:36:41 -07:00
Yineng Zhang
6e7239f912 release 0.0.4.post3 sgl-kernel (#4331) 2025-03-12 01:05:16 -07:00
Yineng Zhang
0a3960f21f fix awq_dequantize (#4333) 2025-03-12 01:04:38 -07:00
Rex
07f944631e Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104) 2025-03-12 00:10:02 -07:00
Stefan He
e0917e6bd0 Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215)
Co-authored-by: Stefan He <bhe@linkedin.com>
2025-03-12 00:08:03 -07:00
Xiaoyu Zhang
7130a7cea9 refine sgl_moe_align_block_size_benchmark (#4327) 2025-03-11 22:48:38 -07:00
Michael Yao
8f1f614ee2 [Docs] Clean up benchmark_and_profiling.md (#4297)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-03-11 21:48:21 -07:00