Commit Graph

1255 Commits

Author SHA1 Message Date
Yineng Zhang
f5c6c66794 feat: support internlm 3 dense (#2888) 2025-01-14 19:23:26 +08:00
Ke Bao
cc0485bef2 Support w8a8 int8 quantization config (#2881) 2025-01-14 17:07:49 +08:00
Ke Bao
c19d84829c Adjust flashinfer workspace size for Qwen2 models (#2879) 2025-01-14 13:34:22 +08:00
Lianmin Zheng
46d4431889 Add a new api configure_logging to allow dumping the requests (#2875) 2025-01-13 14:24:00 -08:00
fzyzcjy
923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) 2025-01-13 11:38:51 -08:00
Xiaoyu Zhang
d08c77c434 Sampling penalties memory interface (#2870) 2025-01-13 23:09:00 +08:00
Lianmin Zheng
c1e097ca66 Revert "Dump requests to a folder" (#2869) 2025-01-13 06:21:25 -08:00
Lzhang-hub
6ec75e626d add qwen2 eagle model (#2863) 2025-01-13 05:29:33 -08:00
Lianmin Zheng
336ff5b9f5 Fix typos in io_struct.py (#2867) 2025-01-13 05:13:02 -08:00
Lianmin Zheng
3b141e1509 Dump requests (#2862) 2025-01-13 04:51:56 -08:00
Lianmin Zheng
6249e4a19e Revert "Integration of TurboMind AWQ" (#2866) 2025-01-13 04:44:39 -08:00
Ke Bao
f3516c2894 Fix quant kernel accuracy issue (#2865) 2025-01-13 20:32:17 +08:00
bjmsong
17de02f98d Integration of TurboMind AWQ (#2828)
Co-authored-by: root <bjmsong@126.com>
2025-01-13 20:14:16 +08:00
Lianmin Zheng
51ab3ccf47 Collect more metrics: num_requests_total (#2859) 2025-01-13 03:57:39 -08:00
Yineng Zhang
41d7e5b7e6 docs: update link (#2857) 2025-01-13 18:40:48 +08:00
kk
42f3909963 Unify sglang coding style (#2856)
Co-authored-by: Lin, Soga <soga.lin@amd.com>
2025-01-13 02:12:44 -08:00
Lianmin Zheng
72c7776355 Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-01-13 01:39:14 -08:00
justdoit
4093aa4660 [Fix]eagle2 health_generate is first request,apiserver will core (#2853) 2025-01-13 01:01:21 -08:00
kk
e808c1df3e Integrate ROCm ater package for ck moe function feasibility (#2854)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
2025-01-13 08:23:07 +00:00
bjmsong
0bb0f76311 Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
2025-01-12 21:17:11 -08:00
Ke Bao
85b2e05770 Add int8 quant kernel (#2848) 2025-01-13 13:16:58 +08:00
Shi Shuai
c4f9707e16 Improve: Token-In Token-Out Usage for RLHF (#2843) 2025-01-11 15:14:26 -08:00
Yineng Zhang
f624901cdd chore: bump v0.4.1.post5 (#2840) 2025-01-11 23:10:02 +08:00
Xiaoyu Zhang
f0e15dc6ab [HotFix] fix fp8 scale load failed in tp>1 (#2837) 2025-01-11 14:34:26 +08:00
Zhiqiang Xie
5d6e9467d4 Cache controller for hierarchical caching (#2804) 2025-01-10 20:22:01 -08:00
justdoit
a47bf39123 [Eagle2] Fix multiple concurrent request crashes (#2730) 2025-01-10 14:00:43 -08:00
TianYu GUO
b170646991 Fix port number overflow (#2826) 2025-01-10 13:44:32 -08:00
Muqi Li
5413ec2bbe [Bugfix] Fix bug in fork logic caused by null text_ (#2835) 2025-01-10 13:37:00 -08:00
Chang Su
f290bd4332 [Bugfix] Fix embedding model hangs with --enable-metrics (#2822) 2025-01-10 13:14:51 -08:00
Pratyush Patel
8f15789314 Add more metrics to serving benchmark. (#2819) 2025-01-10 23:30:44 +08:00
Lianmin Zheng
679c3bcacf Fix typo in cuda_graph_bs (#2813) 2025-01-09 03:03:24 -08:00
Yunmeng
656aed58c6 Remove vllm dependency in model config (#2809) 2025-01-09 17:51:56 +08:00
Ke Bao
b5fb4ef58a Update modelopt config and fix running issue (#2792) 2025-01-08 18:04:30 +08:00
Lianmin Zheng
8a6906127a Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
2025-01-07 23:29:10 -08:00
JJJJOHNSON
694e41925e [eagle2] fix end check when target model verify (#2723) 2025-01-07 21:46:02 -08:00
Lianmin Zheng
b22f3f6475 Fix nightly accuracy tests (#2780) 2025-01-07 21:02:35 -08:00
Zhiqiang Xie
51caee740f Host memory pool for hierarchical caching (#2771) 2025-01-07 21:38:37 +00:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00
HAI
6d08ce2aa9 Use Optional with None default (#2770) 2025-01-07 01:35:08 -08:00
Lianmin Zheng
9dec582dab Remove --modelopt-config in server_args (#2758) 2025-01-06 16:35:45 -08:00
Xingyao Wang
1acbaf1b5a Add generator-style run_batch function (#2513)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-06 15:04:55 -08:00
Zhiyu
287427e2e6 Enable Nvidia's ModelOpt fp8 quantized models (#2535) 2025-01-06 14:54:52 -08:00
Lianmin Zheng
b8574f6953 Clean up eagle code (#2756) 2025-01-06 14:54:18 -08:00
Xu-Chen
2329e1ddd0 Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
2025-01-06 13:56:28 -08:00
Yineng Zhang
2f0d386496 chore: bump v0.4.1.post4 (#2713) 2025-01-06 01:29:54 +08:00
Lianmin Zheng
3a22a303d1 Revert the GLOO_SOCKET_IFNAME change (#2731) 2025-01-04 20:13:16 -08:00
libra
bdb3929dbb Refactor SchedulePolicy to improve code organization (#2571) 2025-01-04 00:05:16 +08:00
Lianmin Zheng
0f9cc6d8d3 Fix package loss for small models (#2717)
Co-authored-by: sdli1995 < mmlmonkey@163.com>
2025-01-02 18:25:26 -08:00
yigex
c7ae474a49 [Feature, Hardware] Enable DeepseekV3 on AMD GPUs (#2601)
Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-01-02 16:23:19 -08:00
Lianmin Zheng
bdf946bf81 Support loading pre-sharded moe weights (#2716) 2025-01-02 15:07:37 -08:00