kk
|
e808c1df3e
|
Integrate ROCm ater package for ck moe function feasibility (#2854)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
|
2025-01-13 08:23:07 +00:00 |
|
bjmsong
|
0bb0f76311
|
Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-12 21:17:11 -08:00 |
|
Ke Bao
|
85b2e05770
|
Add int8 quant kernel (#2848)
|
2025-01-13 13:16:58 +08:00 |
|
Shi Shuai
|
c4f9707e16
|
Improve: Token-In Token-Out Usage for RLHF (#2843)
|
2025-01-11 15:14:26 -08:00 |
|
Yineng Zhang
|
f624901cdd
|
chore: bump v0.4.1.post5 (#2840)
|
2025-01-11 23:10:02 +08:00 |
|
Xiaoyu Zhang
|
f0e15dc6ab
|
[HotFix] fix fp8 scale load failed in tp>1 (#2837)
|
2025-01-11 14:34:26 +08:00 |
|
Zhiqiang Xie
|
5d6e9467d4
|
Cache controller for hierarchical caching (#2804)
|
2025-01-10 20:22:01 -08:00 |
|
justdoit
|
a47bf39123
|
[Eagle2] Fix multiple concurrent request crashes (#2730)
|
2025-01-10 14:00:43 -08:00 |
|
TianYu GUO
|
b170646991
|
Fix port number overflow (#2826)
|
2025-01-10 13:44:32 -08:00 |
|
Muqi Li
|
5413ec2bbe
|
[Bugfix] Fix bug in fork logic caused by null text_ (#2835)
|
2025-01-10 13:37:00 -08:00 |
|
Chang Su
|
f290bd4332
|
[Bugfix] Fix embedding model hangs with --enable-metrics (#2822)
|
2025-01-10 13:14:51 -08:00 |
|
Pratyush Patel
|
8f15789314
|
Add more metrics to serving benchmark. (#2819)
|
2025-01-10 23:30:44 +08:00 |
|
Lianmin Zheng
|
679c3bcacf
|
Fix typo in cuda_graph_bs (#2813)
|
2025-01-09 03:03:24 -08:00 |
|
Yunmeng
|
656aed58c6
|
Remove vllm dependency in model config (#2809)
|
2025-01-09 17:51:56 +08:00 |
|
Ke Bao
|
b5fb4ef58a
|
Update modelopt config and fix running issue (#2792)
|
2025-01-08 18:04:30 +08:00 |
|
Lianmin Zheng
|
8a6906127a
|
Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
|
2025-01-07 23:29:10 -08:00 |
|
JJJJOHNSON
|
694e41925e
|
[eagle2] fix end check when target model verify (#2723)
|
2025-01-07 21:46:02 -08:00 |
|
Lianmin Zheng
|
b22f3f6475
|
Fix nightly accuracy tests (#2780)
|
2025-01-07 21:02:35 -08:00 |
|
Zhiqiang Xie
|
51caee740f
|
Host memory pool for hierarchical caching (#2771)
|
2025-01-07 21:38:37 +00:00 |
|
Lianmin Zheng
|
bdc1acf6cd
|
Misc fix for min_p_sampling, --cuda-graph-bs (#2761)
|
2025-01-07 02:52:53 -08:00 |
|
HAI
|
6d08ce2aa9
|
Use Optional with None default (#2770)
|
2025-01-07 01:35:08 -08:00 |
|
Lianmin Zheng
|
9dec582dab
|
Remove --modelopt-config in server_args (#2758)
|
2025-01-06 16:35:45 -08:00 |
|
Xingyao Wang
|
1acbaf1b5a
|
Add generator-style run_batch function (#2513)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-06 15:04:55 -08:00 |
|
Zhiyu
|
287427e2e6
|
Enable Nvidia's ModelOpt fp8 quantized models (#2535)
|
2025-01-06 14:54:52 -08:00 |
|
Lianmin Zheng
|
b8574f6953
|
Clean up eagle code (#2756)
|
2025-01-06 14:54:18 -08:00 |
|
Xu-Chen
|
2329e1ddd0
|
Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
|
2025-01-06 13:56:28 -08:00 |
|
Yineng Zhang
|
2f0d386496
|
chore: bump v0.4.1.post4 (#2713)
|
2025-01-06 01:29:54 +08:00 |
|
Lianmin Zheng
|
3a22a303d1
|
Revert the GLOO_SOCKET_IFNAME change (#2731)
|
2025-01-04 20:13:16 -08:00 |
|
libra
|
bdb3929dbb
|
Refactor SchedulePolicy to improve code organization (#2571)
|
2025-01-04 00:05:16 +08:00 |
|
Lianmin Zheng
|
0f9cc6d8d3
|
Fix package loss for small models (#2717)
Co-authored-by: sdli1995 < mmlmonkey@163.com>
|
2025-01-02 18:25:26 -08:00 |
|
yigex
|
c7ae474a49
|
[Feature, Hardware] Enable DeepseekV3 on AMD GPUs (#2601)
Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-01-02 16:23:19 -08:00 |
|
Lianmin Zheng
|
bdf946bf81
|
Support loading pre-sharded moe weights (#2716)
|
2025-01-02 15:07:37 -08:00 |
|
yukavio
|
8c8779cd05
|
[Fix] fix retract error in eagle speculative decoding (#2711)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-01-02 10:28:39 -08:00 |
|
Mick
|
1775b963db
|
[Fix] fix incorrectly overwriting the port specified in ServerArgs (#2714)
|
2025-01-02 10:28:22 -08:00 |
|
Yineng Zhang
|
ba5112ff69
|
feat: support moe_align_block_size_triton (#2712)
Co-authored-by: WANDY666 <1060304770@qq.com>
|
2025-01-02 21:47:44 +08:00 |
|
yukavio
|
815dce0554
|
Eagle speculative decoding part 4: Add EAGLE2 worker (#2150)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2025-01-02 03:22:34 -08:00 |
|
Lianmin Zheng
|
ad20b7957e
|
Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-01-02 02:09:08 -08:00 |
|
fzyzcjy
|
9183c23eca
|
Speed up update_weights_from_tensor (#2695)
|
2025-01-02 02:05:19 -08:00 |
|
kk
|
148254d4db
|
Improve moe reduce sum kernel performance (#2705)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-01-02 01:11:06 -08:00 |
|
kk
|
b6e0cfb5e1
|
ROCm base image update (#2692)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-01-01 12:12:19 +08:00 |
|
Xiaoyu Zhang
|
286cad3ee3
|
h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B (#2689)
|
2024-12-31 23:17:36 +08:00 |
|
Ying Sheng
|
dc7eb01f19
|
[Fix] fix openai adapter (#2685)
|
2024-12-31 10:48:19 +00:00 |
|
Lianmin Zheng
|
b0524c3789
|
Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging (#2684)
Co-authored-by: yukavio <kavioyu@gmail.com>
|
2024-12-31 02:25:05 -08:00 |
|
Yineng Zhang
|
d49b13c6f8
|
feat: use CUDA 12.4 by default (for FA3) (#2682)
|
2024-12-31 15:52:09 +08:00 |
|
Lianmin Zheng
|
f44d143949
|
Support target model verification in the attention backend (#2678)
Co-authored-by: yukavio <kavioyu@gmail.com>
|
2024-12-30 22:58:55 -08:00 |
|
Lianmin Zheng
|
339c69a243
|
Improve the computation for time_per_output_token Prometheus metrics (#2674)
|
2024-12-30 21:40:14 -08:00 |
|
Lianmin Zheng
|
21ec66e59e
|
Minor follow-up fixes for the logprob refactor (#2670)
|
2024-12-30 05:42:08 -08:00 |
|
HAI
|
c5210dfa38
|
AMD DeepSeek_V3 FP8 Numerical fix (#2667)
|
2024-12-30 21:31:12 +08:00 |
|
mobicham
|
a29dd9501d
|
Add GemLite caching after each capture (#2669)
|
2024-12-30 05:27:29 -08:00 |
|
Lianmin Zheng
|
9c6ba2484f
|
Refactor logprob computation to return the real logprob used in sampling (#2664)
|
2024-12-30 04:51:38 -08:00 |
|