Byron Hsu
|
521f862d90
|
Fix runtime.generate when sampling param is not passed (#1582)
|
2024-10-05 17:59:05 -07:00 |
|
Byron Hsu
|
34c32d2820
|
Fix styling (#1583)
|
2024-10-05 17:52:14 -07:00 |
|
Byron Hsu
|
dde8bb16fe
|
default sampling param should be deepcopied (#1581)
|
2024-10-05 17:27:43 -07:00 |
|
Byron Hsu
|
8ac3ccc060
|
Backend method not found when SRT Runtime is used (#1576)
|
2024-10-05 11:47:35 -07:00 |
|
Jerry Zhang
|
9b0926ceeb
|
Add llama implementation with no tensor parallel linears (#1561)
|
2024-10-05 11:22:27 -07:00 |
|
Byron Hsu
|
6bfdb4031d
|
[Easy] use .text() instead of .text (#1577)
|
2024-10-05 11:07:41 -07:00 |
|
Liangsheng Yin
|
5d0ba4038f
|
Refine the add request reasons to avoid corner cases. (#1574)
|
2024-10-04 18:00:18 -07:00 |
|
Ying Sheng
|
04b262cd91
|
[Fix] Fix major performance bug in certain cases (#1563)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
|
2024-10-04 08:51:11 +00:00 |
|
Lianmin Zheng
|
45473d4b2b
|
Make input_ids a torch.Tensor (#1568)
|
2024-10-04 01:09:59 -07:00 |
|
Lianmin Zheng
|
114bbc8651
|
Use ipc instead of tcp in zmq (#1566)
|
2024-10-04 00:45:52 -07:00 |
|
Lianmin Zheng
|
32eb6e96f2
|
Organize sampling batch info better (#1562)
|
2024-10-03 18:29:49 -07:00 |
|
HAI
|
e0b5dbcec1
|
[FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale (#1559)
|
2024-10-03 01:52:26 -07:00 |
|
Minsang Song
|
e6852b0dd2
|
[Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' (#1536)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2024-10-02 20:41:15 -07:00 |
|
Lianmin Zheng
|
4ae0969c0a
|
Move status check in the memory pool to CPU (#1557)
|
2024-10-02 18:23:35 -07:00 |
|
Lianmin Zheng
|
317631cada
|
[Fix] Move ScheduleBatch out of SamplingInfo (#1556)
|
2024-10-02 17:18:04 -07:00 |
|
Lianmin Zheng
|
b564835364
|
[Fix] do not maintain regex_fsm in SamplingBatchInfo (#1555)
|
2024-10-02 13:19:44 -07:00 |
|
Theresa Barton
|
2c7d0a5b8b
|
[Fix] Fix all the Huggingface paths (#1553)
|
2024-10-02 10:12:07 -07:00 |
|
kk
|
8cdc76f6d4
|
[Performance, Hardware] MoE tuning on AMD MI300x GPUs (#1554)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2024-10-02 09:52:46 -07:00 |
|
Ying Sheng
|
f202ed9712
|
[Refactor] Simplify io_struct and tokenizer_manager (#1549)
|
2024-10-01 10:25:32 -07:00 |
|
Liangsheng Yin
|
100f5b8bc9
|
Simplify flashinfer dispatch (#1552)
|
2024-10-01 00:28:42 -07:00 |
|
Liangsheng Yin
|
619bb6ddda
|
Dispatch flashinfer wrappers (#1550)
|
2024-09-30 23:12:36 -07:00 |
|
Liangsheng Yin
|
b88ea90d4a
|
Fix bugs of logprobs_nums (#1548)
|
2024-09-30 17:09:54 -07:00 |
|
Liangsheng Yin
|
99ec439da4
|
Organize Attention Backends (#1547)
|
2024-09-30 15:54:18 -07:00 |
|
Ying Sheng
|
0f4fb19bc8
|
[Fix, LoRA] fix LoRA with updates in main (#1545)
|
2024-09-30 10:06:08 -07:00 |
|
Lianmin Zheng
|
63ba2f8d7b
|
Clean up batch data structures: Introducing ModelWorkerBatch (#1544)
|
2024-09-30 06:41:49 -07:00 |
|
Lianmin Zheng
|
36d5acfca5
|
Rename InputMetadata -> ForwardBatch (#1543)
|
2024-09-30 02:41:11 -07:00 |
|
Lianmin Zheng
|
3f0fe08d37
|
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541)
|
2024-09-29 20:28:45 -07:00 |
|
Liangsheng Yin
|
55b974f96f
|
Process image in parallel (#1539)
|
2024-09-29 18:52:43 -07:00 |
|
Lianmin Zheng
|
f86c1e611f
|
Move scheduler code from tp_worker.py to scheduler.py (#1538)
|
2024-09-29 17:42:45 -07:00 |
|
Xinyu Yang
|
acaffd233f
|
[Fix] fix ipv6 url when warm up model (#1537)
|
2024-09-29 11:02:40 -07:00 |
|
Lianmin Zheng
|
048685430d
|
Improve process creation (#1534)
|
2024-09-29 02:36:12 -07:00 |
|
Liangsheng Yin
|
fd9ad817ec
|
Organize image inputs (#1531)
|
2024-09-29 06:28:55 +00:00 |
|
Lianmin Zheng
|
e165a9fc1b
|
Make detokenizer_manager.py not asyncio (#1532)
|
2024-09-28 19:33:09 -07:00 |
|
Lianmin Zheng
|
4e4459b91f
|
Multiple minor fixes (#1530)
|
2024-09-28 14:43:35 -07:00 |
|
Jeffrey Fong
|
065bb94753
|
Fix RuntimeEndpoint.select method (#1495)
|
2024-09-28 14:04:06 -07:00 |
|
Kylin
|
f42e9bfb52
|
[bugfix] Add modelscope package to avoid docker image without modelscope (#1520)
|
2024-09-28 12:43:22 -07:00 |
|
Ninglin Du
|
840c5dbcb3
|
[FIX] Catch syntax error of Regex Guide to avoid crash (#1521)
|
2024-09-28 12:42:06 -07:00 |
|
Jerry Zhang
|
63e845d0bb
|
Add float8 dynamic quant to torchao_utils (#1528)
|
2024-09-28 12:27:54 -07:00 |
|
Ying Sheng
|
9aa6553d2a
|
[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525)
|
2024-09-27 23:32:11 -07:00 |
|
Liangsheng Yin
|
4353acb469
|
minor: fix config (#1524)
|
2024-09-27 01:49:16 -07:00 |
|
Lianmin Zheng
|
9ae1db0bdc
|
[Fix] Ignore import error (#1513)
|
2024-09-25 11:32:21 -07:00 |
|
Ying Sheng
|
37c5899fc2
|
Release v0.3.2 (#1512)
|
2024-09-25 14:17:09 +08:00 |
|
Ying Sheng
|
f39a0197fd
|
Revert "kernel: use tensor cores for flashinfer gqa kernels" (#1511)
|
2024-09-24 22:50:31 -07:00 |
|
TianyiQ
|
3c93187caf
|
Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508)
|
2024-09-24 21:50:20 -07:00 |
|
Lianmin Zheng
|
fb2d0680e0
|
[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510)
|
2024-09-24 21:37:33 -07:00 |
|
Lianmin Zheng
|
067d8e16fc
|
Simplify bench_latency.py (#1503)
|
2024-09-24 17:42:07 -07:00 |
|
luzengxiangcn
|
e6692bf4a5
|
debug radixcache stack_overflow (#1499)
|
2024-09-24 04:58:01 -07:00 |
|
Ke Bao
|
8d4ed42ad5
|
MoE torch compile (#1497)
|
2024-09-24 01:46:59 -07:00 |
|
Lianmin Zheng
|
2854a5ea9f
|
Fix the overhead due to penalizer in bench_latency (#1496)
|
2024-09-23 07:38:14 -07:00 |
|
Yineng Zhang
|
42a2d82ba7
|
minor: add mla fp8 test (#1494)
|
2024-09-23 20:40:17 +08:00 |
|