Commit Graph

227 Commits

Author SHA1 Message Date
Byron Hsu
8233cc10fd [PD] Support logprob & Add failure test (#6558) 2025-05-23 14:29:20 -07:00
Lifu Huang
3cf1473a09 Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-17 16:49:18 -07:00
Elfie Guo
6fc9357503 [2/2] Add python wrapper for CUTLASS FP8 Blockscale MoE Kernel. (#5694) 2025-05-16 13:14:07 -07:00
Kiv Chen
5380cd7ea3 model(vlm): pixtral (#5084) 2025-05-13 00:16:10 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Lianmin Zheng
fba8eccd7e Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 00:17:33 -07:00
Lifu Huang
6e2da51561 Replace time.time() to time.perf_counter() for benchmarking. (#6178)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-11 14:32:49 -07:00
shangmingc
31d1f6e7f4 [PD] Add simple unit test for disaggregation feature (#5654)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-11 13:35:27 +08:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Lianmin Zheng
de167cf5fa Fix request abortion (#6184) 2025-05-10 21:54:46 -07:00
XinyuanTong
e88dd482ed [CI]Add performance CI for VLM (#6038)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-07 19:20:03 -07:00
JieXin Liang
b70957fcf8 [refactor] slightly tidy fp8 module (#5993) 2025-05-07 17:28:24 -07:00
Jinyan Chen
8a828666a3 Add DeepEP to CI PR Test (#5655)
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
2025-05-06 17:36:03 -07:00
mlmz
256c4c2519 fix: correct stream response when enable_thinking is set to false (#5881) 2025-04-30 19:44:37 -07:00
Qiaolin Yu
7bcd8b1cb2 Fix lora batch processing when input lora_path contains None (#5930) 2025-04-30 19:42:42 -07:00
Ying Sheng
11383cec3c [PP] Add pipeline parallelism (#5724) 2025-04-30 18:18:07 -07:00
Qiaolin Yu
58195dd588 [Fix] Unload lora in HF_Runner if needed (#5899) 2025-04-29 20:17:42 -07:00
Lianmin Zheng
849c83a0c0 [CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00
Lianmin Zheng
a38f6932cc [CI] Fix test case (#5790) 2025-04-27 08:55:35 -07:00
Lianmin Zheng
621e96bf9b [CI] Fix ci tests (#5769) 2025-04-27 07:18:10 -07:00
Lianmin Zheng
35ca04d2fa [CI] fix port conflicts (#5789) 2025-04-27 05:17:44 -07:00
Stefan He
408ba02218 Add Llama 4 to FA3 test (#5509) 2025-04-26 19:49:31 -07:00
Mick
c998d04b46 vlm: enable radix cache for qwen-vl models (#5349)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
2025-04-23 20:35:05 -07:00
fzyzcjy
453d412cdb Tiny update error hint (#5037) 2025-04-21 00:47:47 -07:00
JieXin Liang
99456bcacb [perf] introduce deep gemm group_gemm_masked as bmm (#5432) 2025-04-20 00:38:27 -07:00
woodx
3bface15e6 Feat/support encoder model (like bert) (#4887) 2025-04-17 01:50:48 -07:00
Lianmin Zheng
177320a582 Clean up imports (#5467) 2025-04-16 15:26:49 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
JieXin Liang
bdde237562 [perf] experimental enhance fp8 per-tensor quant (#5370) 2025-04-14 12:35:43 -07:00
tianlian yi
bc92107b03 Support server based rollout in Verlengine (#4848)
Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
2025-04-12 10:07:52 -07:00
saienduri
7f875f1293 update grok test (#5171) 2025-04-09 11:09:47 -07:00
Yun Dai
2695ab0537 Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-04-08 09:11:35 -07:00
Yubo Wang
804d9f2e4c Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760) 2025-04-07 23:20:51 -07:00
Yineng Zhang
3289c1207d Update the retry count (#5051) 2025-04-03 17:07:38 -07:00
Xiaoyu Zhang
e9c6ce461d sgl scaled_fp8_quant support output padding (#4861) 2025-04-02 23:53:57 +08:00
Lianmin Zheng
4ede6770cd Fix retract for page size > 1 (#4914) 2025-03-30 02:57:15 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
fzyzcjy
8690c40bb0 Improve stack trace of retry errors (#4845) 2025-03-29 08:21:31 -07:00
Lianmin Zheng
47e6628aae Fix CI tests (#4853) 2025-03-28 00:28:35 -07:00
Pan Lyu
c913ed4046 support clip embedding model (#4506) 2025-03-27 00:18:15 -07:00
fzyzcjy
fa3c9e0668 Fix popen_launch_server wait for 20 minutes when child process exits (#4777) 2025-03-26 00:32:19 -07:00
fzyzcjy
26f07294f1 Warn users when release_memory_occupation is called without memory saver enabled (#4566) 2025-03-26 00:18:14 -07:00
fzyzcjy
15ddd84322 Add retry for flaky tests in CI (#4755) 2025-03-25 16:53:12 -07:00
Stefan He
4c584fc632 Fix circular imports in gptq.py and unblock test explorer (#4736) 2025-03-24 18:07:08 -07:00
Stefan He
5d7edc8e55 Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-03-23 23:28:11 -07:00
Yun Dai
8cd4250401 [quantization] fix channelwise conversion with scalar weight scale (#4596) 2025-03-22 00:47:52 -07:00
aoshen524
588865f0e0 [Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-03-18 20:33:07 -07:00
JieXin Liang
0212d2e288 [Fix] use torch.inference_mode() instead of torch.no_grad() (#4372) 2025-03-16 22:54:16 -07:00
Byron Hsu
8cc300f536 Fix router test (#4483) 2025-03-16 22:49:47 -07:00