Commit Graph

2300 Commits

Author SHA1 Message Date
Kebe
4a893d142d Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB (#3749)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-03-08 03:01:13 -08:00
Lianmin Zheng
8d323e95e4 Use clang format 18 in pr-test-sgl-kernel.yml (#4203) 2025-03-08 01:28:10 -08:00
Mingshan
0fe7c13be1 Fix bench_serving flush cache not recognizing OPENAI_API_KEY (#4181)
Signed-off-by: Mingshan <git@brighill.com>
2025-03-08 01:03:38 -08:00
Lianmin Zheng
08c4d764a5 lazy import attn backends (#4200) 2025-03-08 00:41:35 -08:00
Yineng Zhang
96d0e37fa7 Revert "Minor improvement to per_tensor_quant_fp8 (#4197)" (#4198) 2025-03-07 22:57:09 -08:00
Rex
90bb2be27e Minor improvement to per_tensor_quant_fp8 (#4197) 2025-03-07 22:52:12 -08:00
lukec
b93ef5e56d Remove the vllm dependency from the moe_align function (#4164)
Co-authored-by: Hongbosherlock <hongbosherlock@gmail.com>
2025-03-07 22:42:16 -08:00
Lianmin Zheng
d4017a6b63 [EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-07 22:12:13 -08:00
Lianmin Zheng
d052f4c8a9 New clang format for sgl kernel (#4194) 2025-03-07 20:21:08 -08:00
saienduri
e1aaa79ac9 Update amd ci docker image to v0.4.3.post4-rocm630. (#4189) 2025-03-07 13:02:02 -08:00
Ke Bao
20c8119915 Fix eagle hang issue for max_new_tokens=1 (#4185) 2025-03-07 12:11:18 -08:00
Yineng Zhang
70866b6f4f use same version for ci and pyproject (#4187) 2025-03-07 10:39:55 -08:00
Yineng Zhang
eb61f5c9af Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186) 2025-03-07 10:27:52 -08:00
HAI
0beea4503f ROCm: Flex Attention Enablement with custom backends (#4178)
Co-authored-by: linsun12 <linsun12@amd.com>
2025-03-07 04:38:53 -08:00
Michael Yao
c827c671f7 [Docs] Improve bullets appearance and grammar (#4174)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-03-07 03:16:25 -08:00
Yineng Zhang
b55a621ffb fix int8 doc link (#4179) 2025-03-07 02:49:19 -08:00
lukec
ffa1b3e318 Add an example of using deepseekv3 int8 sglang. (#4177)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-07 01:56:09 -08:00
Yineng Zhang
7e3bb52705 update release-pypi-kernel 2025-03-07 01:48:47 -08:00
Yineng Zhang
96263f275c chore: bump v0.0.3.post7 for sgl-kernel (#4176) 2025-03-07 01:15:34 -08:00
Zhiqiang Xie
9376ac361d Memory pool fix for upstream change about eagle (#4170) 2025-03-07 00:58:20 -08:00
Yineng Zhang
94a2b9d33e Put utils in ifndef USE_ROCM to fix CI (#4167) (#4168)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-03-07 00:01:17 -08:00
Stefan He
3c3eb374b2 Remove non-existent AMD header include (#4166) 2025-03-06 23:29:30 -08:00
Michael Yao
d557319a8b [Docs] Fix links and grammar issues (#4162) 2025-03-06 23:14:18 -08:00
Stefan He
95085d65e9 [Refactor] Reducing code duplication across FP8 CUDA quantization kernels (#4163) 2025-03-06 22:58:52 -08:00
HandH1998
c7f254468f [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: b0urnee <2769086541@qq.com>
2025-03-06 20:54:52 -08:00
Stefan He
63ee26d162 Add sgl_per_token_quant_fp8 (#4089) 2025-03-06 20:53:05 -08:00
Xiaoyu Zhang
ad55f17182 [quant kernel] sgl-kernel support per_tensor_quant fp8 (#3786) 2025-03-06 18:05:43 -08:00
Pan Lyu
361971b859 Add Support for Qwen2-VL Multi-modal Embedding Models (#3694) 2025-03-06 16:46:20 -08:00
HAI
13bc39c5d6 ROCm: enable trillion-parameter MoE models with INT4-FP8 single node (#4152) 2025-03-06 15:33:02 -08:00
Chayenne
9854a18a51 Hot fix small vocal eagle in docs (#4154)
Co-authored-by: ybyang <ybyang7@iflytek.com>
2025-03-06 15:13:26 -08:00
Chayenne
ebddb65aed Docs: add torch compile cache (#4151)
Co-authored-by: ybyang <ybyang7@iflytek.com>
2025-03-06 14:27:09 -08:00
Adarsh Shirawalmath
19fd57bcd7 [docs] fix HF reference script command (#4148) 2025-03-06 13:21:54 -08:00
Lianmin Zheng
9c58e68b4c Release v0.4.3.post4 (#4140) 2025-03-06 12:50:28 -08:00
Oliver Stanley
d03b3467b8 Fix constrained generation errors by adding datasets dependency (#4142) 2025-03-06 12:07:51 -08:00
yinfan98
ab7fba0ece Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant (#4147) 2025-03-06 11:50:07 -08:00
Lianmin Zheng
bc1534ff32 Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-06 06:13:59 -08:00
Lzhang-hub
3a3918121f fix bench serving bug (#4135) 2025-03-06 05:34:02 -08:00
Lianmin Zheng
800bf018fb Update CODEOWNER (#4138) 2025-03-06 03:42:10 -08:00
kk
b16af90bc3 AMD/ROCm: update base image string (#4137)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: yichiche <yichiche@amd.com>
2025-03-06 03:38:54 -08:00
Lianmin Zheng
98c73d71cb [Minor] make the __init__ function of model_runner.py shorter (#4132) 2025-03-06 01:51:12 -08:00
Lianmin Zheng
fcc2e37f69 Split the __init__ of scheduler as smaller functions. Improve the eagle tests (#4128) 2025-03-06 00:13:20 -08:00
Liu Jinjie
0804dd11a0 remove unused max_jobs in setup_rocm.py (#4126)
Signed-off-by: Jinjie Liu <jinjie.liu@usc.edu>
2025-03-06 00:12:19 -08:00
saienduri
55dc8e4d52 Add tag suffix to nightly docker builds. (#4129) 2025-03-05 23:22:36 -08:00
Ying Sheng
02e9e9f1cf Add codeowners for eagle implementations (#4131)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-03-05 23:16:49 -08:00
simveit
8f0b63139e Docs: improve EAGLE docs (#4038) 2025-03-05 22:40:21 -08:00
samzong
b9b3b098b9 feat: support docs auto live-reload with sphinx-autobuild (#4111)
Signed-off-by: samzong <samzong.lu@gmail.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-05 22:39:34 -08:00
Zhiqiang Xie
aee30630d8 Add a pointer to the real KV cache pool (#4113) 2025-03-05 21:39:07 -08:00
Lianmin Zheng
286e6540a6 Remove prefill-only-one-req (#4117) 2025-03-05 20:58:48 -08:00
Wenxuan Tan
718c391fd7 [Hoxfix] Fix incomplete token_to_kv_pool refactor (#4121) 2025-03-05 19:32:42 -08:00
Yineng Zhang
fc671f66c1 chore: bump v0.4.3.post3 (#4114) 2025-03-05 17:26:10 -08:00