Lianmin Zheng
|
2bcfba1b08
|
Skip unnecessary penalizer (#1707)
|
2024-10-18 17:54:03 -07:00 |
|
Lianmin Zheng
|
6790240cc3
|
Fix unit test order to balance the tasks in CI (#1665)
|
2024-10-14 02:01:44 -07:00 |
|
Byron Hsu
|
862cd265e5
|
[engine] support async and streaming (#1614)
|
2024-10-11 15:26:25 -07:00 |
|
Byron Hsu
|
2422de5193
|
Support min_tokens in sgl.gen (#1573)
|
2024-10-05 21:51:12 -07:00 |
|
Byron Hsu
|
dde8bb16fe
|
default sampling param should be deepcopied (#1581)
|
2024-10-05 17:27:43 -07:00 |
|
Byron Hsu
|
6bfdb4031d
|
[Easy] use .text() instead of .text (#1577)
|
2024-10-05 11:07:41 -07:00 |
|
Ying Sheng
|
04b262cd91
|
[Fix] Fix major performance bug in certain cases (#1563)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
|
2024-10-04 08:51:11 +00:00 |
|
Theresa Barton
|
2c7d0a5b8b
|
[Fix] Fix all the Huggingface paths (#1553)
|
2024-10-02 10:12:07 -07:00 |
|
Lianmin Zheng
|
3f0fe08d37
|
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541)
|
2024-09-29 20:28:45 -07:00 |
|
Ying Sheng
|
9aa6553d2a
|
[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525)
|
2024-09-27 23:32:11 -07:00 |
|
Lianmin Zheng
|
fb2d0680e0
|
[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510)
|
2024-09-24 21:37:33 -07:00 |
|
Yineng Zhang
|
42a2d82ba7
|
minor: add mla fp8 test (#1494)
|
2024-09-23 20:40:17 +08:00 |
|
Lianmin Zheng
|
167591e864
|
Better unit tests for adding a new model (#1488)
|
2024-09-22 01:50:37 -07:00 |
|
Ke Bao
|
b8ccaf4d73
|
Add MLA gsm8k eval (#1484)
|
2024-09-21 11:16:13 +08:00 |
|
Ke Bao
|
a68cb201dd
|
Fix triton head num (#1482)
|
2024-09-21 10:25:20 +08:00 |
|
Yineng Zhang
|
a6db88626e
|
minor: add quant eval compared with base (#1475)
|
2024-09-20 01:57:19 +08:00 |
|
Lianmin Zheng
|
1acccb364a
|
Fix oom issues with fp8 for llama (#1454)
|
2024-09-18 03:45:19 -07:00 |
|
Lianmin Zheng
|
5e62a6b706
|
Add bench_server_latency.py (#1452)
|
2024-09-18 00:56:06 -07:00 |
|
Lianmin Zheng
|
899cf5c438
|
Remove deprecated configs (#1431)
|
2024-09-15 08:52:18 -07:00 |
|
Lianmin Zheng
|
9463bc1385
|
Enable torch.compile for triton backend (#1422)
|
2024-09-14 15:38:37 -07:00 |
|
Lianmin Zheng
|
68be2f6d3b
|
[CI] Include triton backend and online serving benchmark into CI (#1408)
|
2024-09-12 21:36:41 -07:00 |
|
Ying Sheng
|
712216928f
|
[Feature] Initial support for multi-LoRA serving (#1307)
|
2024-09-12 16:46:14 -07:00 |
|
Ying Sheng
|
689ff588ec
|
[CI] Return output logprobs in unit test (#1361)
|
2024-09-09 13:05:13 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Lianmin Zheng
|
1e495e0847
|
[Fix] Fix select by ensuring each request has at least one token (#1318)
|
2024-09-03 06:31:45 -07:00 |
|
Yineng Zhang
|
2561ed012c
|
feat: update nightly gsm8k eval (#1304)
|
2024-09-03 01:18:41 +10:00 |
|
Liangsheng Yin
|
381dd57bd6
|
Sampler cudagraph (#1253)
|
2024-08-28 18:58:52 -07:00 |
|
Yineng Zhang
|
b1a540ec42
|
feat: update GemmaRMSNorm (#1232)
|
2024-08-28 22:47:34 +10:00 |
|
Yineng Zhang
|
66975360e7
|
fix: increase max_new_tokens when testing generation models (#1244)
|
2024-08-28 22:12:36 +10:00 |
|
Yineng Zhang
|
f25f4dfde5
|
hotfix: revert sampler CUDA Graph (#1242)
|
2024-08-28 21:16:47 +10:00 |
|
Liangsheng Yin
|
75ce37f401
|
Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-26 07:02:50 -07:00 |
|
Mingyi
|
97589a60a2
|
[CI] Parallelize unit tests in CI (#1219)
|
2024-08-26 04:54:02 +00:00 |
|
Liangsheng Yin
|
632d506d0b
|
minor: improve CI and dependencies (#1212)
|
2024-08-26 04:26:31 +00:00 |
|
Mingyi
|
7514b9f8d3
|
[CI] Fix CI (#1217)
|
2024-08-26 02:56:42 +00:00 |
|
Mingyi
|
158e8f1e2d
|
improve the threshold and ports in tests (#1215)
|
2024-08-25 19:02:08 -07:00 |
|
Lianmin Zheng
|
15f1a49d2d
|
Update CI workflows (#1210)
|
2024-08-25 16:43:07 -07:00 |
|
Ying Sheng
|
308d024092
|
[CI] Fix the issue of unit test hanging (#1211)
|
2024-08-25 16:21:37 -07:00 |
|
Chayenne
|
30b4f771b0
|
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2024-08-25 10:29:12 -07:00 |
|
Ying Sheng
|
1cb4da5c5f
|
[Fix] the issue of random order when input is a list (#1199)
|
2024-08-24 21:43:03 -07:00 |
|
Ying Sheng
|
e61d13acdf
|
[CI] Fix the problem of hf runner too slow (#1202)
|
2024-08-24 18:35:55 -07:00 |
|
Lianmin Zheng
|
f6af3a6561
|
Cleanup readme, llava examples, usage examples and nccl init (#1194)
|
2024-08-24 08:02:23 -07:00 |
|
Yineng Zhang
|
c9064e6fd9
|
feat: use gelu_tanh_and_mul (#1193)
|
2024-08-24 01:58:16 -07:00 |
|
yichuan~
|
b997a18d74
|
[Feat]Add support for optional start len of logprobs (#1035)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2024-08-18 23:45:41 -07:00 |
|
Lianmin Zheng
|
3c1f5a9220
|
Fix duplicated imports in hf_transformers_utils.py (#1141)
|
2024-08-17 18:03:00 -07:00 |
|
Lianmin Zheng
|
57d0bd91ec
|
Improve benchmark (#1140)
|
2024-08-17 17:43:23 -07:00 |
|
Liangsheng Yin
|
f624f6a6cc
|
Fix port conflicts between local CI and runner CI. (#1131)
|
2024-08-16 15:12:38 -07:00 |
|
Liangsheng Yin
|
3694f8f996
|
Mixed style of chunked prefill (#1013)
|
2024-08-16 09:13:00 +00:00 |
|
Liangsheng Yin
|
73cf6834f2
|
Support stop_token_ids in sglang API (#1092)
|
2024-08-15 00:31:39 +00:00 |
|
Ying Sheng
|
96a2093ef0
|
[Fix] Compatibility of window attention and cuda graph (#1090)
|
2024-08-14 10:37:01 -07:00 |
|
Liangsheng Yin
|
a34dd86a7d
|
Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2024-08-14 15:58:07 +00:00 |
|