Commit Graph

1165 Commits

Author SHA1 Message Date
HAI
7722c11c1d Regression fix to AMD/ROCm from recent change (#2606) 2024-12-26 20:22:14 -08:00
fzyzcjy
b2ed5c8ea7 Tiny code cleanup in tokenizer_manager.py (#2586) 2024-12-26 17:53:09 -08:00
fzyzcjy
44f011d224 Super tiny typo fix (#2564) 2024-12-26 08:28:01 -08:00
yudian0504
531d6ea968 fix: package data missing (#2521) 2024-12-26 08:16:48 -08:00
Lianmin Zheng
dc3bee4815 Fix test and benchmark scripts (#2598) 2024-12-26 07:56:26 -08:00
fzyzcjy
3169e66c23 Fix duplicated handling of GetWeightsByNameReqInput (#2565) 2024-12-26 06:49:32 -08:00
Lianmin Zheng
773951548d Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
2024-12-26 06:27:45 -08:00
Adarsh Shirawalmath
acb340728c [Feature] Support new parameter - EBNF in xgrammar (#2526) 2024-12-26 05:12:41 -08:00
Sangchun Ha (Patrick)
08effbff35 Error occurs when loading the gemma model in bitsandbytes format. (#2557) 2024-12-26 05:10:37 -08:00
Liangsheng Yin
e7ebecf82e Fix cache hit rate when chunked prefill (#2555) 2024-12-26 03:14:28 -08:00
Xiaoyu Zhang
9a23c48456 h100 tuning fused_moe_triton for qwen2 moe (#2560) 2024-12-26 03:13:31 -08:00
Yineng Zhang
635a042623 docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
Yineng Zhang
efc52f85e2 chore: bump v0.4.1 (#2582) 2024-12-26 07:14:51 +08:00
Yineng Zhang
60e2fdcf4f use sgl-kernel moe_align_block_size (#2581)
Co-authored-by: ispobock <ispobaoke@163.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-26 06:29:08 +08:00
HandH1998
53aed988cb Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-12-26 00:02:14 +08:00
Ying Sheng
8a56b43175 [Bench] Flush cache before benchmarking (#2566) 2024-12-24 11:21:21 +08:00
Ke Bao
e835a50021 Reorg moe code (#2563) 2024-12-24 01:10:22 +08:00
Lianmin Zheng
23e5e50fd5 Fix gemlite import (#2553) 2024-12-22 20:21:17 -08:00
Lianmin Zheng
41b1db69b8 A better aio rwlock that guarantees the order (#2547) 2024-12-22 15:44:32 -08:00
Lianmin Zheng
8496701934 [Misc] Fix metrics, weight update lock, request logging (#2543) 2024-12-22 06:27:22 -08:00
Lei
19ba2b0ea9 Add lora_paths to v1_chat_generate_request (#2529) 2024-12-22 02:23:33 -08:00
Yineng Zhang
8f4d04e540 chore: bump v0.4.0.post2 (#2525) 2024-12-21 21:16:34 +08:00
Jerry Zhang
feb2b768ba Add integration with gemlite weight only quant (#2528) 2024-12-21 00:25:25 +08:00
Yineng Zhang
4b83db24f1 fix: continue to use flashinfer 0.1.6 temporarily (#2517) 2024-12-19 14:03:24 +08:00
Yineng Zhang
64456cf023 docs: update README (#2516) 2024-12-19 13:44:02 +08:00
Yineng Zhang
bb4a922023 feat: add llama3 eval (#2515) 2024-12-19 13:37:09 +08:00
Lianmin Zheng
21e9e63ad5 Print progress bar during cuda graph capture (#2502) 2024-12-17 06:33:46 -08:00
Lianmin Zheng
361ea8d912 Fix openai protocols and pass top_k, min_p (#2499) 2024-12-17 04:14:14 -08:00
Lei
33c5ff2845 Add lora_path to chat completion (#2438) 2024-12-17 03:47:49 -08:00
Hui Liu
5ce9daea59 ROCm support for sglang.check_env (#2426) 2024-12-17 03:45:14 -08:00
Lianmin Zheng
bd6196163e Small fix for the order of apply_torchao_config (#2495) 2024-12-16 19:21:11 -08:00
Lianmin Zheng
56198b45d9 Add a benchmark script for in-batch prefix caching (#2494) 2024-12-16 18:49:02 -08:00
Lianmin Zheng
ba36b5520a Revert "Small fixes for torchao quant" (#2493) 2024-12-16 15:04:16 -08:00
Lianmin Zheng
7a1aecb938 Simplify pytorch sampling kernel and logit processor (#2491) 2024-12-16 14:11:09 -08:00
Jerry Zhang
82699474fd Small fixes for torchao quant (#2476) 2024-12-16 14:08:12 -08:00
xiaobochen
b532a5fd16 fix moe-ep accuracy issue for fp8 (#2489) 2024-12-16 20:54:02 +08:00
Yineng Zhang
5f2595be43 hotfix: checking for HIP (#2485) 2024-12-15 02:47:26 +08:00
Ke Bao
0ba2c58947 Remove cuda graph batch size adjustment for dp attention (#2484) 2024-12-14 23:53:54 +08:00
Ke Bao
2f9bd0fafd Fix correctness issue for triton decoding kernel (#2479) 2024-12-14 16:50:54 +08:00
Lianmin Zheng
5282a4735f [Minor] Fix grok model loader (#2473) 2024-12-12 14:34:47 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Fred Reiss
993956c6b1 Add support for IBM Granite 3.x models (#2437) 2024-12-11 06:30:23 -08:00
Lianmin Zheng
f8548295d6 Fix warmup in bench_offline_throughput.py (#2449) 2024-12-11 06:16:01 -08:00
Lianmin Zheng
959735fc9e Fix model loader for more quantization formats (#2448) 2024-12-11 05:21:23 -08:00
Yineng Zhang
626a99ac13 chore: update ao v0.7.0 (#2447) 2024-12-11 04:44:28 -08:00
Ke Wen
ece724910a Make torch TP composable with torchao (#2436) 2024-12-11 04:21:42 -08:00
Ying Sheng
8586b72da0 [feat] Enable chunked prefill for llava-onevision (#2412) 2024-12-09 09:52:38 -08:00
Lianmin Zheng
641b7d0ae0 [Minor] Improve code style (#2422) 2024-12-09 06:30:35 -08:00
Lianmin Zheng
0ce091a82d [Minor] Improve code style (#2419) 2024-12-09 03:05:59 -08:00
Lianmin Zheng
835f8afc77 Migrate llama_classification to use the /classify interface (#2417) 2024-12-08 23:30:51 -08:00