Lianmin Zheng
|
46d4431889
|
Add a new api configure_logging to allow dumping the requests (#2875)
|
2025-01-13 14:24:00 -08:00 |
|
fzyzcjy
|
923f518337
|
CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630)
|
2025-01-13 11:38:51 -08:00 |
|
Xiaoyu Zhang
|
d08c77c434
|
Sampling penalties memory interface (#2870)
|
2025-01-13 23:09:00 +08:00 |
|
Lianmin Zheng
|
c1e097ca66
|
Revert "Dump requests to a folder" (#2869)
|
2025-01-13 06:21:25 -08:00 |
|
Lzhang-hub
|
6ec75e626d
|
add qwen2 eagle model (#2863)
|
2025-01-13 05:29:33 -08:00 |
|
Yineng Zhang
|
d855653bd4
|
minor: fix release docs (#2868)
|
2025-01-13 21:18:39 +08:00 |
|
Lianmin Zheng
|
336ff5b9f5
|
Fix typos in io_struct.py (#2867)
|
2025-01-13 05:13:02 -08:00 |
|
Lianmin Zheng
|
3b141e1509
|
Dump requests (#2862)
|
2025-01-13 04:51:56 -08:00 |
|
Lianmin Zheng
|
6249e4a19e
|
Revert "Integration of TurboMind AWQ" (#2866)
|
2025-01-13 04:44:39 -08:00 |
|
Ke Bao
|
f3516c2894
|
Fix quant kernel accuracy issue (#2865)
|
2025-01-13 20:32:17 +08:00 |
|
bjmsong
|
17de02f98d
|
Integration of TurboMind AWQ (#2828)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-13 20:14:16 +08:00 |
|
Lianmin Zheng
|
51ab3ccf47
|
Collect more metrics: num_requests_total (#2859)
|
2025-01-13 03:57:39 -08:00 |
|
Lianmin Zheng
|
67008f4b32
|
Use only one GPU for MLA CI tests (#2858)
|
2025-01-13 03:55:33 -08:00 |
|
Yineng Zhang
|
4536d72446
|
minor: use ubuntu-latest instead of self-hosted runner for amd build (#2861)
|
2025-01-13 18:58:56 +08:00 |
|
Yineng Zhang
|
41d7e5b7e6
|
docs: update link (#2857)
|
2025-01-13 18:40:48 +08:00 |
|
Yineng Zhang
|
20a9f5dfe0
|
fix: not delete CNAME (#2860)
|
2025-01-13 18:36:40 +08:00 |
|
kk
|
42f3909963
|
Unify sglang coding style (#2856)
Co-authored-by: Lin, Soga <soga.lin@amd.com>
|
2025-01-13 02:12:44 -08:00 |
|
Lianmin Zheng
|
72c7776355
|
Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-01-13 01:39:14 -08:00 |
|
justdoit
|
4093aa4660
|
[Fix]eagle2 health_generate is first request,apiserver will core (#2853)
|
2025-01-13 01:01:21 -08:00 |
|
kk
|
e808c1df3e
|
Integrate ROCm ater package for ck moe function feasibility (#2854)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
|
2025-01-13 08:23:07 +00:00 |
|
sogalin
|
a18ab81ddd
|
Update base image for ROCm (#2852)
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-01-13 14:39:44 +08:00 |
|
bjmsong
|
0bb0f76311
|
Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-12 21:17:11 -08:00 |
|
Ke Bao
|
85b2e05770
|
Add int8 quant kernel (#2848)
|
2025-01-13 13:16:58 +08:00 |
|
Yineng Zhang
|
a879c2fb4c
|
fix sgl-kernel build (#2850)
|
2025-01-13 12:27:17 +08:00 |
|
Xiaoyu Zhang
|
e2b16c4716
|
add sampling_scaling_penalties kernel (#2846)
|
2025-01-12 19:38:17 -08:00 |
|
Shi Shuai
|
c4f9707e16
|
Improve: Token-In Token-Out Usage for RLHF (#2843)
|
2025-01-11 15:14:26 -08:00 |
|
Yineng Zhang
|
197cbf9bab
|
docs: update README (#2841)
|
2025-01-11 23:11:38 +08:00 |
|
Yineng Zhang
|
f624901cdd
|
chore: bump v0.4.1.post5 (#2840)
|
2025-01-11 23:10:02 +08:00 |
|
Xiaoyu Zhang
|
f0e15dc6ab
|
[HotFix] fix fp8 scale load failed in tp>1 (#2837)
|
2025-01-11 14:34:26 +08:00 |
|
Lianmin Zheng
|
f1769586d6
|
Update threshold in test_nightly_gsm8k_eval.py (#2836)
|
2025-01-10 20:37:34 -08:00 |
|
Zhiqiang Xie
|
5d6e9467d4
|
Cache controller for hierarchical caching (#2804)
|
2025-01-10 20:22:01 -08:00 |
|
justdoit
|
a47bf39123
|
[Eagle2] Fix multiple concurrent request crashes (#2730)
|
2025-01-10 14:00:43 -08:00 |
|
TianYu GUO
|
b170646991
|
Fix port number overflow (#2826)
|
2025-01-10 13:44:32 -08:00 |
|
Muqi Li
|
5413ec2bbe
|
[Bugfix] Fix bug in fork logic caused by null text_ (#2835)
|
2025-01-10 13:37:00 -08:00 |
|
Chang Su
|
f290bd4332
|
[Bugfix] Fix embedding model hangs with --enable-metrics (#2822)
|
2025-01-10 13:14:51 -08:00 |
|
Pratyush Patel
|
8f15789314
|
Add more metrics to serving benchmark. (#2819)
|
2025-01-10 23:30:44 +08:00 |
|
Lianmin Zheng
|
2db03a04ca
|
Update README.md (#2833)
Co-authored-by: Heiner <heiner@x.ai>
|
2025-01-10 03:49:04 -08:00 |
|
Chayenne
|
5cc1170552
|
Doc: add block-wise FP8 in dpsk model reference (#2830)
|
2025-01-10 00:26:59 -08:00 |
|
Xiaotong Jiang
|
11fffbc95a
|
[Doc]: Deepseek reference docs (#2787)
|
2025-01-09 13:43:12 -08:00 |
|
sleepcoo
|
4f077c01b8
|
minor: support specifying local dataset path for gsm8k and hellaswag (#2816)
|
2025-01-09 22:24:42 +08:00 |
|
Lianmin Zheng
|
679c3bcacf
|
Fix typo in cuda_graph_bs (#2813)
|
2025-01-09 03:03:24 -08:00 |
|
Yunmeng
|
656aed58c6
|
Remove vllm dependency in model config (#2809)
|
2025-01-09 17:51:56 +08:00 |
|
Ke Bao
|
b5fb4ef58a
|
Update modelopt config and fix running issue (#2792)
|
2025-01-08 18:04:30 +08:00 |
|
Chayenne
|
2e6346fc2e
|
Docs:Update the style of llma 3.1 405B docs (#2789)
|
2025-01-08 01:07:54 -08:00 |
|
mlmz
|
977f785dad
|
Docs: Rewrite docs for LLama 405B and ModelSpace (#2773)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-01-08 00:02:59 -08:00 |
|
Lianmin Zheng
|
8a6906127a
|
Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
|
2025-01-07 23:29:10 -08:00 |
|
JJJJOHNSON
|
694e41925e
|
[eagle2] fix end check when target model verify (#2723)
|
2025-01-07 21:46:02 -08:00 |
|
Lianmin Zheng
|
b22f3f6475
|
Fix nightly accuracy tests (#2780)
|
2025-01-07 21:02:35 -08:00 |
|
Lianmin Zheng
|
6fb5768372
|
Disable math eval on nightly CI temporarily (#2779)
|
2025-01-07 18:17:34 -08:00 |
|
Zhiqiang Xie
|
51caee740f
|
Host memory pool for hierarchical caching (#2771)
|
2025-01-07 21:38:37 +00:00 |
|