yudian0504
|
531d6ea968
|
fix: package data missing (#2521)
|
2024-12-26 08:16:48 -08:00 |
|
Lianmin Zheng
|
dc3bee4815
|
Fix test and benchmark scripts (#2598)
|
2024-12-26 07:56:26 -08:00 |
|
fzyzcjy
|
3169e66c23
|
Fix duplicated handling of GetWeightsByNameReqInput (#2565)
|
2024-12-26 06:49:32 -08:00 |
|
Lianmin Zheng
|
773951548d
|
Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
|
2024-12-26 06:27:45 -08:00 |
|
Adarsh Shirawalmath
|
acb340728c
|
[Feature] Support new parameter - EBNF in xgrammar (#2526)
|
2024-12-26 05:12:41 -08:00 |
|
Sangchun Ha (Patrick)
|
08effbff35
|
Error occurs when loading the gemma model in bitsandbytes format. (#2557)
|
2024-12-26 05:10:37 -08:00 |
|
Liangsheng Yin
|
e7ebecf82e
|
Fix cache hit rate when chunked prefill (#2555)
|
2024-12-26 03:14:28 -08:00 |
|
Xiaoyu Zhang
|
9a23c48456
|
h100 tuning fused_moe_triton for qwen2 moe (#2560)
|
2024-12-26 03:13:31 -08:00 |
|
Yineng Zhang
|
635a042623
|
docs: update deepseek v3 example (#2592)
|
2024-12-26 17:43:37 +08:00 |
|
Yineng Zhang
|
efc52f85e2
|
chore: bump v0.4.1 (#2582)
|
2024-12-26 07:14:51 +08:00 |
|
Yineng Zhang
|
60e2fdcf4f
|
use sgl-kernel moe_align_block_size (#2581)
Co-authored-by: ispobock <ispobaoke@163.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-26 06:29:08 +08:00 |
|
HandH1998
|
53aed988cb
|
Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2024-12-26 00:02:14 +08:00 |
|
Ying Sheng
|
8a56b43175
|
[Bench] Flush cache before benchmarking (#2566)
|
2024-12-24 11:21:21 +08:00 |
|
Ke Bao
|
e835a50021
|
Reorg moe code (#2563)
|
2024-12-24 01:10:22 +08:00 |
|
Lianmin Zheng
|
23e5e50fd5
|
Fix gemlite import (#2553)
|
2024-12-22 20:21:17 -08:00 |
|
Lianmin Zheng
|
41b1db69b8
|
A better aio rwlock that guarantees the order (#2547)
|
2024-12-22 15:44:32 -08:00 |
|
Lianmin Zheng
|
8496701934
|
[Misc] Fix metrics, weight update lock, request logging (#2543)
|
2024-12-22 06:27:22 -08:00 |
|
Lei
|
19ba2b0ea9
|
Add lora_paths to v1_chat_generate_request (#2529)
|
2024-12-22 02:23:33 -08:00 |
|
Yineng Zhang
|
8f4d04e540
|
chore: bump v0.4.0.post2 (#2525)
|
2024-12-21 21:16:34 +08:00 |
|
Jerry Zhang
|
feb2b768ba
|
Add integration with gemlite weight only quant (#2528)
|
2024-12-21 00:25:25 +08:00 |
|
Yineng Zhang
|
4b83db24f1
|
fix: continue to use flashinfer 0.1.6 temporarily (#2517)
|
2024-12-19 14:03:24 +08:00 |
|
Yineng Zhang
|
64456cf023
|
docs: update README (#2516)
|
2024-12-19 13:44:02 +08:00 |
|
Yineng Zhang
|
bb4a922023
|
feat: add llama3 eval (#2515)
|
2024-12-19 13:37:09 +08:00 |
|
Lianmin Zheng
|
21e9e63ad5
|
Print progress bar during cuda graph capture (#2502)
|
2024-12-17 06:33:46 -08:00 |
|
Lianmin Zheng
|
361ea8d912
|
Fix openai protocols and pass top_k, min_p (#2499)
|
2024-12-17 04:14:14 -08:00 |
|
Lei
|
33c5ff2845
|
Add lora_path to chat completion (#2438)
|
2024-12-17 03:47:49 -08:00 |
|
Hui Liu
|
5ce9daea59
|
ROCm support for sglang.check_env (#2426)
|
2024-12-17 03:45:14 -08:00 |
|
Lianmin Zheng
|
bd6196163e
|
Small fix for the order of apply_torchao_config (#2495)
|
2024-12-16 19:21:11 -08:00 |
|
Lianmin Zheng
|
56198b45d9
|
Add a benchmark script for in-batch prefix caching (#2494)
|
2024-12-16 18:49:02 -08:00 |
|
Lianmin Zheng
|
ba36b5520a
|
Revert "Small fixes for torchao quant" (#2493)
|
2024-12-16 15:04:16 -08:00 |
|
Lianmin Zheng
|
7a1aecb938
|
Simplify pytorch sampling kernel and logit processor (#2491)
|
2024-12-16 14:11:09 -08:00 |
|
Jerry Zhang
|
82699474fd
|
Small fixes for torchao quant (#2476)
|
2024-12-16 14:08:12 -08:00 |
|
xiaobochen
|
b532a5fd16
|
fix moe-ep accuracy issue for fp8 (#2489)
|
2024-12-16 20:54:02 +08:00 |
|
Yineng Zhang
|
5f2595be43
|
hotfix: checking for HIP (#2485)
|
2024-12-15 02:47:26 +08:00 |
|
Ke Bao
|
0ba2c58947
|
Remove cuda graph batch size adjustment for dp attention (#2484)
|
2024-12-14 23:53:54 +08:00 |
|
Ke Bao
|
2f9bd0fafd
|
Fix correctness issue for triton decoding kernel (#2479)
|
2024-12-14 16:50:54 +08:00 |
|
Lianmin Zheng
|
5282a4735f
|
[Minor] Fix grok model loader (#2473)
|
2024-12-12 14:34:47 -08:00 |
|
SangBin Cho
|
9208618b3e
|
[Core] in batch prefix caching by delay scheduling (#2442)
|
2024-12-11 12:51:50 -08:00 |
|
Fred Reiss
|
993956c6b1
|
Add support for IBM Granite 3.x models (#2437)
|
2024-12-11 06:30:23 -08:00 |
|
Lianmin Zheng
|
f8548295d6
|
Fix warmup in bench_offline_throughput.py (#2449)
|
2024-12-11 06:16:01 -08:00 |
|
Lianmin Zheng
|
959735fc9e
|
Fix model loader for more quantization formats (#2448)
|
2024-12-11 05:21:23 -08:00 |
|
Yineng Zhang
|
626a99ac13
|
chore: update ao v0.7.0 (#2447)
|
2024-12-11 04:44:28 -08:00 |
|
Ke Wen
|
ece724910a
|
Make torch TP composable with torchao (#2436)
|
2024-12-11 04:21:42 -08:00 |
|
Ying Sheng
|
8586b72da0
|
[feat] Enable chunked prefill for llava-onevision (#2412)
|
2024-12-09 09:52:38 -08:00 |
|
Lianmin Zheng
|
641b7d0ae0
|
[Minor] Improve code style (#2422)
|
2024-12-09 06:30:35 -08:00 |
|
Lianmin Zheng
|
0ce091a82d
|
[Minor] Improve code style (#2419)
|
2024-12-09 03:05:59 -08:00 |
|
Lianmin Zheng
|
835f8afc77
|
Migrate llama_classification to use the /classify interface (#2417)
|
2024-12-08 23:30:51 -08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Byron Hsu
|
27f7bed7a7
|
reduce watchdog interval to 5s (#2410)
|
2024-12-08 21:17:31 -08:00 |
|
Lianmin Zheng
|
a6ca736c8e
|
Simplify stream_output (#2398)
|
2024-12-08 12:27:13 -08:00 |
|