Chayenne
|
bf3edc2c60
|
Docs: Update pull_request_template.md (#2928)
|
2025-01-16 13:04:11 -08:00 |
|
Xiaoyu Zhang
|
78e974b2a5
|
[kernel] MiniMax-Text-01 decode lightning_attn with triton (#2920)
|
2025-01-16 12:51:38 -08:00 |
|
Lianmin Zheng
|
bc6915e3b9
|
Improve type annotation and styles (#2926)
|
2025-01-16 12:51:11 -08:00 |
|
saienduri
|
a883f0790d
|
Update release-docker-amd.yml to run on amd docker runner. (#2927)
|
2025-01-16 12:42:29 -08:00 |
|
Lianmin Zheng
|
8b6ce52e92
|
Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
|
2025-01-16 11:15:00 -08:00 |
|
Ke Bao
|
58f3f2b840
|
Add CI for sgl-kernel (#2924)
|
2025-01-17 01:26:51 +08:00 |
|
Lianmin Zheng
|
93d690617e
|
Simplify the process launch code in server.py (#2923)
|
2025-01-16 07:52:17 -08:00 |
|
Yun Dai
|
e00e5385e0
|
add profiling to bench_one_batch script (#2821)
|
2025-01-16 07:24:24 -08:00 |
|
Rin Intachuen
|
a2f602b541
|
fixed lm_head.weight error for quantized qwen (#2910)
|
2025-01-16 06:51:43 -08:00 |
|
Lianmin Zheng
|
8f2c522aba
|
Improve benchmark scripts and error message printing (#2922)
|
2025-01-16 06:24:31 -08:00 |
|
Yineng Zhang
|
7596417732
|
minor: use bear for compilation database (#2919)
|
2025-01-16 18:39:11 +08:00 |
|
Yineng Zhang
|
2dc957d421
|
fix setup for sgl kernel (#2917)
|
2025-01-16 18:17:34 +08:00 |
|
Yineng Zhang
|
bf8d07a6f9
|
feat: patch linear base (#2915)
|
2025-01-16 18:00:03 +08:00 |
|
Xiaoyu Zhang
|
ab31793661
|
[kernel] MiniMax-Text-01 prefill lightning_attn with triton (#2911)
|
2025-01-16 14:18:29 +08:00 |
|
Yineng Zhang
|
b7f3fec13c
|
minor: rename bench for sgl kernel (#2909)
|
2025-01-16 05:55:43 +08:00 |
|
Yineng Zhang
|
58f42b1dd8
|
minor: update pr test (#2908)
|
2025-01-16 05:51:49 +08:00 |
|
yizhang2077
|
767c9dec03
|
adapt custom allreduce for tensorrt llm (#2511)
|
2025-01-16 04:57:35 +08:00 |
|
Yineng Zhang
|
a53454c55e
|
fix: sgl-kernel link cuda (#2906)
|
2025-01-16 04:53:23 +08:00 |
|
yizhang2077
|
6cb3974e77
|
optimize custom allreduce kernel (#2904)
|
2025-01-16 03:04:25 +08:00 |
|
Lianmin Zheng
|
f65c13b559
|
Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902)
|
2025-01-15 04:54:14 -08:00 |
|
Cody Yu
|
b803b395b7
|
Disable graceful shutdown of tokenizer manager when not in the main thread (#2872)
|
2025-01-15 03:29:33 -08:00 |
|
Ke Bao
|
bfbda62c8b
|
Add ut for w8a8 int8 quantization (#2897)
|
2025-01-15 18:29:14 +08:00 |
|
Yineng Zhang
|
b3e99dfb22
|
chore: bump v0.4.1.post6 (#2899)
|
2025-01-15 16:23:42 +08:00 |
|
Xiaoyu Zhang
|
f005758f2b
|
introduce CUB in sgl-kernel (#2887)
|
2025-01-14 19:48:59 +08:00 |
|
Yineng Zhang
|
f5c6c66794
|
feat: support internlm 3 dense (#2888)
|
2025-01-14 19:23:26 +08:00 |
|
Ke Bao
|
cc0485bef2
|
Support w8a8 int8 quantization config (#2881)
|
2025-01-14 17:07:49 +08:00 |
|
kk
|
b8cd09f27a
|
update ROCm docker for layernorm kernel optimization (#2885)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-01-14 16:59:43 +08:00 |
|
Ke Bao
|
c19d84829c
|
Adjust flashinfer workspace size for Qwen2 models (#2879)
|
2025-01-14 13:34:22 +08:00 |
|
Yineng Zhang
|
80002562a8
|
docs: update README (#2878)
|
2025-01-14 12:48:17 +08:00 |
|
Lianmin Zheng
|
46d4431889
|
Add a new api configure_logging to allow dumping the requests (#2875)
|
2025-01-13 14:24:00 -08:00 |
|
fzyzcjy
|
923f518337
|
CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630)
|
2025-01-13 11:38:51 -08:00 |
|
Xiaoyu Zhang
|
d08c77c434
|
Sampling penalties memory interface (#2870)
|
2025-01-13 23:09:00 +08:00 |
|
Lianmin Zheng
|
c1e097ca66
|
Revert "Dump requests to a folder" (#2869)
|
2025-01-13 06:21:25 -08:00 |
|
Lzhang-hub
|
6ec75e626d
|
add qwen2 eagle model (#2863)
|
2025-01-13 05:29:33 -08:00 |
|
Yineng Zhang
|
d855653bd4
|
minor: fix release docs (#2868)
|
2025-01-13 21:18:39 +08:00 |
|
Lianmin Zheng
|
336ff5b9f5
|
Fix typos in io_struct.py (#2867)
|
2025-01-13 05:13:02 -08:00 |
|
Lianmin Zheng
|
3b141e1509
|
Dump requests (#2862)
|
2025-01-13 04:51:56 -08:00 |
|
Lianmin Zheng
|
6249e4a19e
|
Revert "Integration of TurboMind AWQ" (#2866)
|
2025-01-13 04:44:39 -08:00 |
|
Ke Bao
|
f3516c2894
|
Fix quant kernel accuracy issue (#2865)
|
2025-01-13 20:32:17 +08:00 |
|
bjmsong
|
17de02f98d
|
Integration of TurboMind AWQ (#2828)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-13 20:14:16 +08:00 |
|
Lianmin Zheng
|
51ab3ccf47
|
Collect more metrics: num_requests_total (#2859)
|
2025-01-13 03:57:39 -08:00 |
|
Lianmin Zheng
|
67008f4b32
|
Use only one GPU for MLA CI tests (#2858)
|
2025-01-13 03:55:33 -08:00 |
|
Yineng Zhang
|
4536d72446
|
minor: use ubuntu-latest instead of self-hosted runner for amd build (#2861)
|
2025-01-13 18:58:56 +08:00 |
|
Yineng Zhang
|
41d7e5b7e6
|
docs: update link (#2857)
|
2025-01-13 18:40:48 +08:00 |
|
Yineng Zhang
|
20a9f5dfe0
|
fix: not delete CNAME (#2860)
|
2025-01-13 18:36:40 +08:00 |
|
kk
|
42f3909963
|
Unify sglang coding style (#2856)
Co-authored-by: Lin, Soga <soga.lin@amd.com>
|
2025-01-13 02:12:44 -08:00 |
|
Lianmin Zheng
|
72c7776355
|
Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-01-13 01:39:14 -08:00 |
|
justdoit
|
4093aa4660
|
[Fix]eagle2 health_generate is first request,apiserver will core (#2853)
|
2025-01-13 01:01:21 -08:00 |
|
kk
|
e808c1df3e
|
Integrate ROCm ater package for ck moe function feasibility (#2854)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
|
2025-01-13 08:23:07 +00:00 |
|
sogalin
|
a18ab81ddd
|
Update base image for ROCm (#2852)
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-01-13 14:39:44 +08:00 |
|