Byron Hsu
|
8233cc10fd
|
[PD] Support logprob & Add failure test (#6558)
|
2025-05-23 14:29:20 -07:00 |
|
Lifu Huang
|
3cf1473a09
|
Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
|
2025-05-17 16:49:18 -07:00 |
|
Elfie Guo
|
6fc9357503
|
[2/2] Add python wrapper for CUTLASS FP8 Blockscale MoE Kernel. (#5694)
|
2025-05-16 13:14:07 -07:00 |
|
Kiv Chen
|
5380cd7ea3
|
model(vlm): pixtral (#5084)
|
2025-05-13 00:16:10 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Lianmin Zheng
|
fba8eccd7e
|
Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 00:17:33 -07:00 |
|
Lifu Huang
|
6e2da51561
|
Replace time.time() to time.perf_counter() for benchmarking. (#6178)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
|
2025-05-11 14:32:49 -07:00 |
|
shangmingc
|
31d1f6e7f4
|
[PD] Add simple unit test for disaggregation feature (#5654)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-11 13:35:27 +08:00 |
|
applesaucethebun
|
2ce8793519
|
Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-11 12:55:00 +08:00 |
|
Lianmin Zheng
|
de167cf5fa
|
Fix request abortion (#6184)
|
2025-05-10 21:54:46 -07:00 |
|
XinyuanTong
|
e88dd482ed
|
[CI]Add performance CI for VLM (#6038)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-05-07 19:20:03 -07:00 |
|
JieXin Liang
|
b70957fcf8
|
[refactor] slightly tidy fp8 module (#5993)
|
2025-05-07 17:28:24 -07:00 |
|
Jinyan Chen
|
8a828666a3
|
Add DeepEP to CI PR Test (#5655)
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
|
2025-05-06 17:36:03 -07:00 |
|
mlmz
|
256c4c2519
|
fix: correct stream response when enable_thinking is set to false (#5881)
|
2025-04-30 19:44:37 -07:00 |
|
Qiaolin Yu
|
7bcd8b1cb2
|
Fix lora batch processing when input lora_path contains None (#5930)
|
2025-04-30 19:42:42 -07:00 |
|
Ying Sheng
|
11383cec3c
|
[PP] Add pipeline parallelism (#5724)
|
2025-04-30 18:18:07 -07:00 |
|
Qiaolin Yu
|
58195dd588
|
[Fix] Unload lora in HF_Runner if needed (#5899)
|
2025-04-29 20:17:42 -07:00 |
|
Lianmin Zheng
|
849c83a0c0
|
[CI] test chunked prefill more (#5798)
|
2025-04-28 10:57:17 -07:00 |
|
Lianmin Zheng
|
a38f6932cc
|
[CI] Fix test case (#5790)
|
2025-04-27 08:55:35 -07:00 |
|
Lianmin Zheng
|
621e96bf9b
|
[CI] Fix ci tests (#5769)
|
2025-04-27 07:18:10 -07:00 |
|
Lianmin Zheng
|
35ca04d2fa
|
[CI] fix port conflicts (#5789)
|
2025-04-27 05:17:44 -07:00 |
|
Stefan He
|
408ba02218
|
Add Llama 4 to FA3 test (#5509)
|
2025-04-26 19:49:31 -07:00 |
|
Mick
|
c998d04b46
|
vlm: enable radix cache for qwen-vl models (#5349)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-04-23 20:35:05 -07:00 |
|
fzyzcjy
|
453d412cdb
|
Tiny update error hint (#5037)
|
2025-04-21 00:47:47 -07:00 |
|
JieXin Liang
|
99456bcacb
|
[perf] introduce deep gemm group_gemm_masked as bmm (#5432)
|
2025-04-20 00:38:27 -07:00 |
|
woodx
|
3bface15e6
|
Feat/support encoder model (like bert) (#4887)
|
2025-04-17 01:50:48 -07:00 |
|
Lianmin Zheng
|
177320a582
|
Clean up imports (#5467)
|
2025-04-16 15:26:49 -07:00 |
|
Baizhou Zhang
|
a42736bbb8
|
Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113)
|
2025-04-15 22:01:22 -07:00 |
|
JieXin Liang
|
bdde237562
|
[perf] experimental enhance fp8 per-tensor quant (#5370)
|
2025-04-14 12:35:43 -07:00 |
|
tianlian yi
|
bc92107b03
|
Support server based rollout in Verlengine (#4848)
Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
|
2025-04-12 10:07:52 -07:00 |
|
saienduri
|
7f875f1293
|
update grok test (#5171)
|
2025-04-09 11:09:47 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
Yubo Wang
|
804d9f2e4c
|
Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760)
|
2025-04-07 23:20:51 -07:00 |
|
Yineng Zhang
|
3289c1207d
|
Update the retry count (#5051)
|
2025-04-03 17:07:38 -07:00 |
|
Xiaoyu Zhang
|
e9c6ce461d
|
sgl scaled_fp8_quant support output padding (#4861)
|
2025-04-02 23:53:57 +08:00 |
|
Lianmin Zheng
|
4ede6770cd
|
Fix retract for page size > 1 (#4914)
|
2025-03-30 02:57:15 -07:00 |
|
Lianmin Zheng
|
b26bc86b36
|
Support page size > 1 + eagle (#4908)
|
2025-03-30 00:46:23 -07:00 |
|
fzyzcjy
|
8690c40bb0
|
Improve stack trace of retry errors (#4845)
|
2025-03-29 08:21:31 -07:00 |
|
Lianmin Zheng
|
47e6628aae
|
Fix CI tests (#4853)
|
2025-03-28 00:28:35 -07:00 |
|
Pan Lyu
|
c913ed4046
|
support clip embedding model (#4506)
|
2025-03-27 00:18:15 -07:00 |
|
fzyzcjy
|
fa3c9e0668
|
Fix popen_launch_server wait for 20 minutes when child process exits (#4777)
|
2025-03-26 00:32:19 -07:00 |
|
fzyzcjy
|
26f07294f1
|
Warn users when release_memory_occupation is called without memory saver enabled (#4566)
|
2025-03-26 00:18:14 -07:00 |
|
fzyzcjy
|
15ddd84322
|
Add retry for flaky tests in CI (#4755)
|
2025-03-25 16:53:12 -07:00 |
|
Stefan He
|
4c584fc632
|
Fix circular imports in gptq.py and unblock test explorer (#4736)
|
2025-03-24 18:07:08 -07:00 |
|
Stefan He
|
5d7edc8e55
|
Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-03-23 23:28:11 -07:00 |
|
Yun Dai
|
8cd4250401
|
[quantization] fix channelwise conversion with scalar weight scale (#4596)
|
2025-03-22 00:47:52 -07:00 |
|
aoshen524
|
588865f0e0
|
[Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-03-18 20:33:07 -07:00 |
|
JieXin Liang
|
0212d2e288
|
[Fix] use torch.inference_mode() instead of torch.no_grad() (#4372)
|
2025-03-16 22:54:16 -07:00 |
|
Byron Hsu
|
8cc300f536
|
Fix router test (#4483)
|
2025-03-16 22:49:47 -07:00 |
|