Lianmin Zheng
|
8f2c522aba
|
Improve benchmark scripts and error message printing (#2922)
|
2025-01-16 06:24:31 -08:00 |
|
yizhang2077
|
767c9dec03
|
adapt custom allreduce for tensorrt llm (#2511)
|
2025-01-16 04:57:35 +08:00 |
|
Ke Bao
|
bfbda62c8b
|
Add ut for w8a8 int8 quantization (#2897)
|
2025-01-15 18:29:14 +08:00 |
|
fzyzcjy
|
923f518337
|
CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630)
|
2025-01-13 11:38:51 -08:00 |
|
Lianmin Zheng
|
6249e4a19e
|
Revert "Integration of TurboMind AWQ" (#2866)
|
2025-01-13 04:44:39 -08:00 |
|
bjmsong
|
17de02f98d
|
Integration of TurboMind AWQ (#2828)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-13 20:14:16 +08:00 |
|
Lianmin Zheng
|
51ab3ccf47
|
Collect more metrics: num_requests_total (#2859)
|
2025-01-13 03:57:39 -08:00 |
|
Lianmin Zheng
|
67008f4b32
|
Use only one GPU for MLA CI tests (#2858)
|
2025-01-13 03:55:33 -08:00 |
|
Lianmin Zheng
|
72c7776355
|
Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-01-13 01:39:14 -08:00 |
|
bjmsong
|
0bb0f76311
|
Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-12 21:17:11 -08:00 |
|
Shi Shuai
|
c4f9707e16
|
Improve: Token-In Token-Out Usage for RLHF (#2843)
|
2025-01-11 15:14:26 -08:00 |
|
Lianmin Zheng
|
f1769586d6
|
Update threshold in test_nightly_gsm8k_eval.py (#2836)
|
2025-01-10 20:37:34 -08:00 |
|
justdoit
|
a47bf39123
|
[Eagle2] Fix multiple concurrent request crashes (#2730)
|
2025-01-10 14:00:43 -08:00 |
|
Chang Su
|
f290bd4332
|
[Bugfix] Fix embedding model hangs with --enable-metrics (#2822)
|
2025-01-10 13:14:51 -08:00 |
|
JJJJOHNSON
|
694e41925e
|
[eagle2] fix end check when target model verify (#2723)
|
2025-01-07 21:46:02 -08:00 |
|
Lianmin Zheng
|
b22f3f6475
|
Fix nightly accuracy tests (#2780)
|
2025-01-07 21:02:35 -08:00 |
|
Lianmin Zheng
|
6fb5768372
|
Disable math eval on nightly CI temporarily (#2779)
|
2025-01-07 18:17:34 -08:00 |
|
libra
|
bdb3929dbb
|
Refactor SchedulePolicy to improve code organization (#2571)
|
2025-01-04 00:05:16 +08:00 |
|
Lianmin Zheng
|
0f9cc6d8d3
|
Fix package loss for small models (#2717)
Co-authored-by: sdli1995 < mmlmonkey@163.com>
|
2025-01-02 18:25:26 -08:00 |
|
Shi Shuai
|
dd2e2d275f
|
Docs: Update documentation workflow and contribution guide (#2704)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-01-02 09:18:31 -08:00 |
|
yukavio
|
815dce0554
|
Eagle speculative decoding part 4: Add EAGLE2 worker (#2150)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2025-01-02 03:22:34 -08:00 |
|
fzyzcjy
|
9183c23eca
|
Speed up update_weights_from_tensor (#2695)
|
2025-01-02 02:05:19 -08:00 |
|
Xiaotong Jiang
|
a4d6d6f1dd
|
[feat]: Add math eval to CI nightly run (#2663)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-01-01 15:29:35 -08:00 |
|
Lianmin Zheng
|
21ec66e59e
|
Minor follow-up fixes for the logprob refactor (#2670)
|
2024-12-30 05:42:08 -08:00 |
|
Lianmin Zheng
|
9c6ba2484f
|
Refactor logprob computation to return the real logprob used in sampling (#2664)
|
2024-12-30 04:51:38 -08:00 |
|
Lianmin Zheng
|
3231817861
|
Revert "[feat] Add math eval to CI" (#2656)
|
2024-12-30 15:05:50 +08:00 |
|
Xiaotong Jiang
|
a11f8d5f6a
|
[feat] Add math eval to CI (#2652)
|
2024-12-30 14:49:41 +08:00 |
|
Chayenne
|
1703d766d8
|
CI: skip special token for engine token ids unit test (#2648)
|
2024-12-29 13:52:50 -08:00 |
|
Shi Shuai
|
fad29f7f52
|
CI: Fix unittest for engine input token ids and output token ids (#2646)
|
2024-12-29 13:28:59 -08:00 |
|
Shi Shuai
|
35bdb48557
|
[Feature] Get Token IDs with Engine.generate() (#2636)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2024-12-29 12:28:27 -08:00 |
|
Ying Sheng
|
e0e09fceeb
|
[Session] Update session control interface (#2635)
|
2024-12-29 02:10:27 -08:00 |
|
Tanjiro
|
8ee9a8501a
|
[Feature] Function Calling (#2544)
Co-authored-by: Haoyu Wang <120358163+HaoyuWang4188@users.noreply.github.com>
|
2024-12-28 21:58:52 -08:00 |
|
fzyzcjy
|
fd28640dc5
|
Add update_weights_from_tensor (#2631)
|
2024-12-28 13:30:27 -08:00 |
|
Lianmin Zheng
|
855d0ba381
|
[CI] Fix nightly test and raise better error message (#2626)
Co-authored-by: Sangbin <rkooo567@gmail.com>
|
2024-12-27 22:16:39 -08:00 |
|
Lianmin Zheng
|
dc3bee4815
|
Fix test and benchmark scripts (#2598)
|
2024-12-26 07:56:26 -08:00 |
|
Zhizhou Sha
|
a74d194146
|
[unittest] add unit test to test quant args of srt engine (#2574)
|
2024-12-26 06:54:43 -08:00 |
|
Adarsh Shirawalmath
|
acb340728c
|
[Feature] Support new parameter - EBNF in xgrammar (#2526)
|
2024-12-26 05:12:41 -08:00 |
|
Ke Bao
|
e835a50021
|
Reorg moe code (#2563)
|
2024-12-24 01:10:22 +08:00 |
|
Lianmin Zheng
|
8496701934
|
[Misc] Fix metrics, weight update lock, request logging (#2543)
|
2024-12-22 06:27:22 -08:00 |
|
Lianmin Zheng
|
9cd9dc83b3
|
Temporarily disable unit test of torch native attention backend (#2492)
|
2024-12-16 14:17:27 -08:00 |
|
Ke Bao
|
2f9bd0fafd
|
Fix correctness issue for triton decoding kernel (#2479)
|
2024-12-14 16:50:54 +08:00 |
|
Fred Reiss
|
993956c6b1
|
Add support for IBM Granite 3.x models (#2437)
|
2024-12-11 06:30:23 -08:00 |
|
Ying Sheng
|
8586b72da0
|
[feat] Enable chunked prefill for llava-onevision (#2412)
|
2024-12-09 09:52:38 -08:00 |
|
Lianmin Zheng
|
641b7d0ae0
|
[Minor] Improve code style (#2422)
|
2024-12-09 06:30:35 -08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Lianmin Zheng
|
a6ca736c8e
|
Simplify stream_output (#2398)
|
2024-12-08 12:27:13 -08:00 |
|
Yineng Zhang
|
f62055b528
|
minor: add random flashinfer vs triton use case (#2409)
|
2024-12-09 04:15:21 +08:00 |
|
Yineng Zhang
|
74bc9184c3
|
minor: add random use case (#2408)
|
2024-12-09 03:21:35 +08:00 |
|
Yineng Zhang
|
0f8eb15323
|
feat: support custom task runner (#2407)
|
2024-12-09 02:29:55 +08:00 |
|
Yineng Zhang
|
67470bbb28
|
minor: update correct measurement unit (#2406)
|
2024-12-08 20:55:04 +08:00 |
|