Lianmin Zheng
|
36d5acfca5
|
Rename InputMetadata -> ForwardBatch (#1543)
|
2024-09-30 02:41:11 -07:00 |
|
Lianmin Zheng
|
3f0fe08d37
|
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541)
|
2024-09-29 20:28:45 -07:00 |
|
Liangsheng Yin
|
55b974f96f
|
Process image in parallel (#1539)
|
2024-09-29 18:52:43 -07:00 |
|
Lianmin Zheng
|
f86c1e611f
|
Move scheduler code from tp_worker.py to scheduler.py (#1538)
|
2024-09-29 17:42:45 -07:00 |
|
Xinyu Yang
|
acaffd233f
|
[Fix] fix ipv6 url when warm up model (#1537)
|
2024-09-29 11:02:40 -07:00 |
|
Lianmin Zheng
|
048685430d
|
Improve process creation (#1534)
|
2024-09-29 02:36:12 -07:00 |
|
Liangsheng Yin
|
fd9ad817ec
|
Organize image inputs (#1531)
|
2024-09-29 06:28:55 +00:00 |
|
Lianmin Zheng
|
e165a9fc1b
|
Make detokenizer_manager.py not asyncio (#1532)
|
2024-09-28 19:33:09 -07:00 |
|
Lianmin Zheng
|
4e4459b91f
|
Multiple minor fixes (#1530)
|
2024-09-28 14:43:35 -07:00 |
|
Jeffrey Fong
|
065bb94753
|
Fix RuntimeEndpoint.select method (#1495)
|
2024-09-28 14:04:06 -07:00 |
|
Kylin
|
f42e9bfb52
|
[bugfix] Add modelscope package to avoid docker image without modelscope (#1520)
|
2024-09-28 12:43:22 -07:00 |
|
Ninglin Du
|
840c5dbcb3
|
[FIX] Catch syntax error of Regex Guide to avoid crash (#1521)
|
2024-09-28 12:42:06 -07:00 |
|
Jerry Zhang
|
63e845d0bb
|
Add float8 dynamic quant to torchao_utils (#1528)
|
2024-09-28 12:27:54 -07:00 |
|
Ying Sheng
|
9aa6553d2a
|
[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525)
|
2024-09-27 23:32:11 -07:00 |
|
Liangsheng Yin
|
4353acb469
|
minor: fix config (#1524)
|
2024-09-27 01:49:16 -07:00 |
|
Lianmin Zheng
|
9ae1db0bdc
|
[Fix] Ignore import error (#1513)
|
2024-09-25 11:32:21 -07:00 |
|
Ying Sheng
|
37c5899fc2
|
Release v0.3.2 (#1512)
|
2024-09-25 14:17:09 +08:00 |
|
Ying Sheng
|
f39a0197fd
|
Revert "kernel: use tensor cores for flashinfer gqa kernels" (#1511)
|
2024-09-24 22:50:31 -07:00 |
|
TianyiQ
|
3c93187caf
|
Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508)
|
2024-09-24 21:50:20 -07:00 |
|
Lianmin Zheng
|
fb2d0680e0
|
[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510)
|
2024-09-24 21:37:33 -07:00 |
|
Lianmin Zheng
|
067d8e16fc
|
Simplify bench_latency.py (#1503)
|
2024-09-24 17:42:07 -07:00 |
|
luzengxiangcn
|
e6692bf4a5
|
debug radixcache stack_overflow (#1499)
|
2024-09-24 04:58:01 -07:00 |
|
Ke Bao
|
8d4ed42ad5
|
MoE torch compile (#1497)
|
2024-09-24 01:46:59 -07:00 |
|
Lianmin Zheng
|
2854a5ea9f
|
Fix the overhead due to penalizer in bench_latency (#1496)
|
2024-09-23 07:38:14 -07:00 |
|
Yineng Zhang
|
42a2d82ba7
|
minor: add mla fp8 test (#1494)
|
2024-09-23 20:40:17 +08:00 |
|
Ying Sheng
|
e4780cf839
|
[API, Feature] Support response prefill for openai API (#1490)
|
2024-09-22 06:46:17 -07:00 |
|
Lianmin Zheng
|
39bb49d156
|
Update dockerfile to include datamodel_code_generator (#1492)
|
2024-09-22 04:49:16 -07:00 |
|
wellhowtosay
|
2a99993cd9
|
Pr fix max workers (#1456)
Co-authored-by: baolujia <baolujia@shizhuang-inc.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-22 02:20:26 -07:00 |
|
Lianmin Zheng
|
167591e864
|
Better unit tests for adding a new model (#1488)
|
2024-09-22 01:50:37 -07:00 |
|
Yineng Zhang
|
82136eb0b5
|
chore: bump v0.3.1.post3 (#1483)
|
2024-09-21 11:17:45 +08:00 |
|
Ke Bao
|
b8ccaf4d73
|
Add MLA gsm8k eval (#1484)
|
2024-09-21 11:16:13 +08:00 |
|
Ke Bao
|
a68cb201dd
|
Fix triton head num (#1482)
|
2024-09-21 10:25:20 +08:00 |
|
Yineng Zhang
|
a6db88626e
|
minor: add quant eval compared with base (#1475)
|
2024-09-20 01:57:19 +08:00 |
|
Yineng Zhang
|
b4408b0d16
|
feat: update linear deps 1/N (#1305)
|
2024-09-19 20:53:11 +08:00 |
|
Lianmin Zheng
|
2cd7e181dd
|
Fix env vars in bench_latency (#1472)
|
2024-09-19 03:19:26 -07:00 |
|
Lianmin Zheng
|
5ce55aee15
|
Release v0.3.1.post2 (#1470)
|
2024-09-19 02:03:38 -07:00 |
|
Lianmin Zheng
|
2d346a57c2
|
Fix padding in the cuda graph (#1469)
|
2024-09-19 01:52:15 -07:00 |
|
Lianmin Zheng
|
7f24ea95c3
|
Fuse top_k and top_k in the sampler (#1457)
|
2024-09-18 04:35:35 -07:00 |
|
Lianmin Zheng
|
1acccb364a
|
Fix oom issues with fp8 for llama (#1454)
|
2024-09-18 03:45:19 -07:00 |
|
HAI
|
aa2750beb3
|
[Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) (#1453)
|
2024-09-18 02:01:35 -07:00 |
|
Lianmin Zheng
|
5e62a6b706
|
Add bench_server_latency.py (#1452)
|
2024-09-18 00:56:06 -07:00 |
|
Xiao Yu
|
5752f25eef
|
Fixed n>1 causing list index out of range with VLM (#1449)
|
2024-09-18 00:46:32 -07:00 |
|
Liangsheng Yin
|
7c162fa9c5
|
Fix schedule bug (#1451)
|
2024-09-17 22:59:32 -07:00 |
|
Liangsheng Yin
|
36078fb247
|
fix schedule bug (#1450)
|
2024-09-17 16:33:53 -07:00 |
|
Ke Bao
|
b3710d2c93
|
Fix attention backend (#1448)
|
2024-09-17 14:07:53 +00:00 |
|
Ke Bao
|
c6b6d2e71b
|
Enable MLA by default (#1447)
|
2024-09-17 11:42:48 +00:00 |
|
Lianmin Zheng
|
90a26be31c
|
Release 0.3.1.post1 (#1445)
|
2024-09-17 01:47:31 -07:00 |
|
Jani Monoses
|
1f4b5f770d
|
Add OLMoE model (#1444)
|
2024-09-17 01:14:53 -07:00 |
|
Ke Bao
|
76524b70d1
|
Fix torch compile for deepseek-v2 (#1442)
|
2024-09-17 00:52:08 -07:00 |
|
HAI
|
3a6e04185b
|
[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420)
|
2024-09-17 07:43:52 +00:00 |
|