sglang

Author	SHA1	Message	Date
kk	e808c1df3e	Integrate ROCm ater package for ck moe function feasibility (#2854 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Lin, Soga <soga.lin@amd.com>	2025-01-13 08:23:07 +00:00
bjmsong	0bb0f76311	Support FP8 E4M3 KV Cache (#2786 ) Co-authored-by: root <bjmsong@126.com>	2025-01-12 21:17:11 -08:00
Ke Bao	85b2e05770	Add int8 quant kernel (#2848 )	2025-01-13 13:16:58 +08:00
Shi Shuai	c4f9707e16	Improve: Token-In Token-Out Usage for RLHF (#2843 )	2025-01-11 15:14:26 -08:00
Yineng Zhang	f624901cdd	chore: bump v0.4.1.post5 (#2840 )	2025-01-11 23:10:02 +08:00
Xiaoyu Zhang	f0e15dc6ab	[HotFix] fix fp8 scale load failed in tp>1 (#2837 )	2025-01-11 14:34:26 +08:00
Zhiqiang Xie	5d6e9467d4	Cache controller for hierarchical caching (#2804 )	2025-01-10 20:22:01 -08:00
justdoit	a47bf39123	[Eagle2] Fix multiple concurrent request crashes (#2730 )	2025-01-10 14:00:43 -08:00
TianYu GUO	b170646991	Fix port number overflow (#2826 )	2025-01-10 13:44:32 -08:00
Muqi Li	5413ec2bbe	[Bugfix] Fix bug in fork logic caused by null text_ (#2835 )	2025-01-10 13:37:00 -08:00
Chang Su	f290bd4332	[Bugfix] Fix embedding model hangs with `--enable-metrics` (#2822 )	2025-01-10 13:14:51 -08:00
Pratyush Patel	8f15789314	Add more metrics to serving benchmark. (#2819 )	2025-01-10 23:30:44 +08:00
Lianmin Zheng	679c3bcacf	Fix typo in cuda_graph_bs (#2813 )	2025-01-09 03:03:24 -08:00
Yunmeng	656aed58c6	Remove vllm dependency in model config (#2809 )	2025-01-09 17:51:56 +08:00
Ke Bao	b5fb4ef58a	Update modelopt config and fix running issue (#2792 )	2025-01-08 18:04:30 +08:00
Lianmin Zheng	8a6906127a	Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784 ) Co-authored-by: SangBin Cho rkooo567@gmail.com	2025-01-07 23:29:10 -08:00
JJJJOHNSON	694e41925e	[eagle2] fix end check when target model verify (#2723 )	2025-01-07 21:46:02 -08:00
Lianmin Zheng	b22f3f6475	Fix nightly accuracy tests (#2780 )	2025-01-07 21:02:35 -08:00
Zhiqiang Xie	51caee740f	Host memory pool for hierarchical caching (#2771 )	2025-01-07 21:38:37 +00:00
Lianmin Zheng	bdc1acf6cd	Misc fix for min_p_sampling, --cuda-graph-bs (#2761 )	2025-01-07 02:52:53 -08:00
HAI	6d08ce2aa9	Use Optional with None default (#2770 )	2025-01-07 01:35:08 -08:00
Lianmin Zheng	9dec582dab	Remove --modelopt-config in server_args (#2758 )	2025-01-06 16:35:45 -08:00
Xingyao Wang	1acbaf1b5a	Add generator-style run_batch function (#2513 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-01-06 15:04:55 -08:00
Zhiyu	287427e2e6	Enable Nvidia's ModelOpt fp8 quantized models (#2535 )	2025-01-06 14:54:52 -08:00
Lianmin Zheng	b8574f6953	Clean up eagle code (#2756 )	2025-01-06 14:54:18 -08:00
Xu-Chen	2329e1ddd0	Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748 ) Co-authored-by: chenxu02 <chenxu02@zhihu.com>	2025-01-06 13:56:28 -08:00
Yineng Zhang	2f0d386496	chore: bump v0.4.1.post4 (#2713 )	2025-01-06 01:29:54 +08:00
Lianmin Zheng	3a22a303d1	Revert the GLOO_SOCKET_IFNAME change (#2731 )	2025-01-04 20:13:16 -08:00
libra	bdb3929dbb	Refactor SchedulePolicy to improve code organization (#2571 )	2025-01-04 00:05:16 +08:00
Lianmin Zheng	0f9cc6d8d3	Fix package loss for small models (#2717 ) Co-authored-by: sdli1995 < mmlmonkey@163.com>	2025-01-02 18:25:26 -08:00
yigex	c7ae474a49	[Feature, Hardware] Enable DeepseekV3 on AMD GPUs (#2601 ) Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Bruce Xue <yigex@xilinx.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2025-01-02 16:23:19 -08:00
Lianmin Zheng	bdf946bf81	Support loading pre-sharded moe weights (#2716 )	2025-01-02 15:07:37 -08:00
yukavio	8c8779cd05	[Fix] fix retract error in eagle speculative decoding (#2711 ) Co-authored-by: kavioyu <kavioyu@tencent.com>	2025-01-02 10:28:39 -08:00
Mick	1775b963db	[Fix] fix incorrectly overwriting the port specified in ServerArgs (#2714 )	2025-01-02 10:28:22 -08:00
Yineng Zhang	ba5112ff69	feat: support moe_align_block_size_triton (#2712 ) Co-authored-by: WANDY666 <1060304770@qq.com>	2025-01-02 21:47:44 +08:00
yukavio	815dce0554	Eagle speculative decoding part 4: Add EAGLE2 worker (#2150 ) Co-authored-by: kavioyu <kavioyu@tencent.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2025-01-02 03:22:34 -08:00
Lianmin Zheng	ad20b7957e	Eagle speculative decoding part 3: small modifications to the general scheduler (#2709 ) Co-authored-by: kavioyu <kavioyu@tencent.com>	2025-01-02 02:09:08 -08:00
fzyzcjy	9183c23eca	Speed up `update_weights_from_tensor` (#2695 )	2025-01-02 02:05:19 -08:00
kk	148254d4db	Improve moe reduce sum kernel performance (#2705 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2025-01-02 01:11:06 -08:00
kk	b6e0cfb5e1	ROCm base image update (#2692 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2025-01-01 12:12:19 +08:00
Xiaoyu Zhang	286cad3ee3	h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B (#2689 )	2024-12-31 23:17:36 +08:00
Ying Sheng	dc7eb01f19	[Fix] fix openai adapter (#2685 )	2024-12-31 10:48:19 +00:00
Lianmin Zheng	b0524c3789	Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging (#2684 ) Co-authored-by: yukavio <kavioyu@gmail.com>	2024-12-31 02:25:05 -08:00
Yineng Zhang	d49b13c6f8	feat: use CUDA 12.4 by default (for FA3) (#2682 )	2024-12-31 15:52:09 +08:00
Lianmin Zheng	f44d143949	Support target model verification in the attention backend (#2678 ) Co-authored-by: yukavio <kavioyu@gmail.com>	2024-12-30 22:58:55 -08:00
Lianmin Zheng	339c69a243	Improve the computation for time_per_output_token Prometheus metrics (#2674 )	2024-12-30 21:40:14 -08:00
Lianmin Zheng	21ec66e59e	Minor follow-up fixes for the logprob refactor (#2670 )	2024-12-30 05:42:08 -08:00
HAI	c5210dfa38	AMD DeepSeek_V3 FP8 Numerical fix (#2667 )	2024-12-30 21:31:12 +08:00
mobicham	a29dd9501d	Add GemLite caching after each capture (#2669 )	2024-12-30 05:27:29 -08:00
Lianmin Zheng	9c6ba2484f	Refactor logprob computation to return the real logprob used in sampling (#2664 )	2024-12-30 04:51:38 -08:00

1 2 3 4 5 ...

1237 Commits