Commit Graph

215 Commits

Author SHA1 Message Date
Chunyuan WU
63051738a9 Enable CPU device on SGLang (#2806) 2025-01-16 21:22:53 -08:00
Lianmin Zheng
8b6ce52e92 Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
2025-01-16 11:15:00 -08:00
Rin Intachuen
a2f602b541 fixed lm_head.weight error for quantized qwen (#2910) 2025-01-16 06:51:43 -08:00
Yineng Zhang
f5c6c66794 feat: support internlm 3 dense (#2888) 2025-01-14 19:23:26 +08:00
Lzhang-hub
6ec75e626d add qwen2 eagle model (#2863) 2025-01-13 05:29:33 -08:00
bjmsong
0bb0f76311 Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
2025-01-12 21:17:11 -08:00
Yunmeng
656aed58c6 Remove vllm dependency in model config (#2809) 2025-01-09 17:51:56 +08:00
Lianmin Zheng
8a6906127a Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
2025-01-07 23:29:10 -08:00
HAI
6d08ce2aa9 Use Optional with None default (#2770) 2025-01-07 01:35:08 -08:00
Xu-Chen
2329e1ddd0 Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
2025-01-06 13:56:28 -08:00
Lianmin Zheng
bdf946bf81 Support loading pre-sharded moe weights (#2716) 2025-01-02 15:07:37 -08:00
HAI
c5210dfa38 AMD DeepSeek_V3 FP8 Numerical fix (#2667) 2024-12-30 21:31:12 +08:00
Lianmin Zheng
9c05c6898e Add llama_eagle.py (#2640)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2024-12-29 01:45:35 -08:00
Lianmin Zheng
3815b23ccb Clean up wrapper in flashinfer backend (#2638) 2024-12-29 00:45:57 -08:00
fzyzcjy
44f011d224 Super tiny typo fix (#2564) 2024-12-26 08:28:01 -08:00
Sangchun Ha (Patrick)
08effbff35 Error occurs when loading the gemma model in bitsandbytes format. (#2557) 2024-12-26 05:10:37 -08:00
HandH1998
53aed988cb Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-12-26 00:02:14 +08:00
Ke Bao
e835a50021 Reorg moe code (#2563) 2024-12-24 01:10:22 +08:00
Lianmin Zheng
5282a4735f [Minor] Fix grok model loader (#2473) 2024-12-12 14:34:47 -08:00
Fred Reiss
993956c6b1 Add support for IBM Granite 3.x models (#2437) 2024-12-11 06:30:23 -08:00
Lianmin Zheng
959735fc9e Fix model loader for more quantization formats (#2448) 2024-12-11 05:21:23 -08:00
Ying Sheng
8586b72da0 [feat] Enable chunked prefill for llava-onevision (#2412) 2024-12-09 09:52:38 -08:00
Lianmin Zheng
641b7d0ae0 [Minor] Improve code style (#2422) 2024-12-09 06:30:35 -08:00
Lianmin Zheng
835f8afc77 Migrate llama_classification to use the /classify interface (#2417) 2024-12-08 23:30:51 -08:00
Sangchun Ha (Patrick)
63dfab1bea Fix shape error that occurred when loading lora weight of gemma2 model. (#2330) 2024-12-08 01:04:08 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
xiaobochen
3d32e4a32c Resubmit MoE-EP (#2371) 2024-12-06 15:05:21 +08:00
Ke Bao
4a63c181f1 Fix AWQ with enable MLA (#2364) 2024-12-06 00:46:48 +08:00
Jerry Zhang
9cc733b38c move apply_torchao_config_ to model_runner (#2342) 2024-12-04 17:26:42 -08:00
Ke Bao
ec52464dde MLA prefill w/o weight absorption (#2349) 2024-12-05 01:50:28 +08:00
Lianmin Zheng
1228f7ca69 Fix gptq for moe layers (#2300)
Co-authored-by: root <me@zhyncs.com>
2024-12-03 23:12:33 +08:00
Ying Sheng
aa47f64223 Revert "[feat] Enable chunked prefill for llava-onevision" (#2329) 2024-12-02 23:11:13 -08:00
Ying Sheng
480e38a733 [feat] Enable chunked prefill for llava-onevision (#2281) 2024-12-02 20:19:02 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Lianmin Zheng
4936be8acc Revert "Revert "[FEAT] Support GGUF format"" (#2287) 2024-11-30 22:14:48 -08:00
Lianmin Zheng
7e4c6dd8da Revert "[FEAT] Support GGUF format" (#2285) 2024-11-30 19:03:26 -08:00
Yang Zheng
883c955489 [FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-30 00:44:48 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Lianmin Zheng
afe1e46586 [Minor] fix the style for multimodal models (#2257) 2024-11-29 04:24:20 -08:00
Lianmin Zheng
f50a6cf443 Fix hash collision for multi modal models (#2256) 2024-11-29 03:15:58 -08:00
Ying Sheng
8b48496aaf Revert "Revert "Add simple CPU offloading support"" (#2253)
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-28 23:58:54 -08:00
Ying Sheng
4057ea82c9 Revert "Add simple CPU offloading support" (#2252)
We'll re-add the commit to correctly ack Kaichao's authorship
2024-11-28 23:36:55 -08:00
Ying Sheng
b7038fec9b [fix] Fix prefix caching for multi-image/video (#2239) 2024-11-28 12:08:13 -08:00
Jani Monoses
db674e3d24 Add OLMo2 model. (#2233) 2024-11-28 00:15:20 -08:00
Lianmin Zheng
dd5eba4c88 Remove fused_moe_grok (#2223) 2024-11-27 14:28:55 -08:00
Ying Sheng
37c8a5761f [feat] Support session control for vision language models (#2210) 2024-11-27 00:03:29 -08:00
Lianmin Zheng
be0124bda0 Rename triton_fused_moe -> fused_moe_triton (#2163) 2024-11-24 08:12:35 -08:00
Yineng Zhang
e3938b2f9c feat: update other MoE models deps (#2156) 2024-11-24 21:36:34 +08:00
Yineng Zhang
b509db5832 feat: remove the dependency on FusedMoE (#2153) 2024-11-24 20:09:27 +08:00