Commit Graph

1277 Commits

Author SHA1 Message Date
Ke Bao
656dcc1a99 Remove fp8 monkey patch (#2960) 2025-01-18 15:00:29 +08:00
Zhiqiang Xie
8af7048dcf Query remaining memory dynamically for PrefillAdder (#2941) 2025-01-17 20:20:26 -08:00
bjmsong
d3024f4fc8 support e4m3 kvcache in qwen2 & add kv scaling facotr json (#2894)
Co-authored-by: bjmsong <bjmsong@126.com>
2025-01-18 11:43:22 +08:00
Yineng Zhang
78e5b22f29 feat: use get_rope for gemma2 (#2954) 2025-01-18 02:57:18 +08:00
Yineng Zhang
7a15e9ad36 cleanup models unused import 2/n (#2952) 2025-01-18 01:09:19 +08:00
Yineng Zhang
033c715b46 cleanup models dependencies 1/n (#2948) 2025-01-17 23:46:48 +08:00
Ke Bao
53e6552fed Fix qwen accuracy issue (#2945) 2025-01-17 22:35:26 +08:00
Yineng Zhang
5dc54f1a62 feat: remove vllm distributed (#2907)
Co-authored-by: Zhangyi <1109276519@qq.com>
2025-01-17 22:31:51 +08:00
Chunyuan WU
63051738a9 Enable CPU device on SGLang (#2806) 2025-01-16 21:22:53 -08:00
Chang Su
a8ccacc8b8 [Frontend] Fix request length check and add option to disallow auto truncation in scheduler (#2876) 2025-01-16 14:51:19 -08:00
Lianmin Zheng
0427416b59 Fix zmq binding (#2930)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
2025-01-16 14:36:07 -08:00
Lianmin Zheng
bc6915e3b9 Improve type annotation and styles (#2926) 2025-01-16 12:51:11 -08:00
Lianmin Zheng
8b6ce52e92 Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
2025-01-16 11:15:00 -08:00
Lianmin Zheng
93d690617e Simplify the process launch code in server.py (#2923) 2025-01-16 07:52:17 -08:00
Yun Dai
e00e5385e0 add profiling to bench_one_batch script (#2821) 2025-01-16 07:24:24 -08:00
Rin Intachuen
a2f602b541 fixed lm_head.weight error for quantized qwen (#2910) 2025-01-16 06:51:43 -08:00
Lianmin Zheng
8f2c522aba Improve benchmark scripts and error message printing (#2922) 2025-01-16 06:24:31 -08:00
Yineng Zhang
bf8d07a6f9 feat: patch linear base (#2915) 2025-01-16 18:00:03 +08:00
yizhang2077
767c9dec03 adapt custom allreduce for tensorrt llm (#2511) 2025-01-16 04:57:35 +08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
Cody Yu
b803b395b7 Disable graceful shutdown of tokenizer manager when not in the main thread (#2872) 2025-01-15 03:29:33 -08:00
Yineng Zhang
b3e99dfb22 chore: bump v0.4.1.post6 (#2899) 2025-01-15 16:23:42 +08:00
Yineng Zhang
f5c6c66794 feat: support internlm 3 dense (#2888) 2025-01-14 19:23:26 +08:00
Ke Bao
cc0485bef2 Support w8a8 int8 quantization config (#2881) 2025-01-14 17:07:49 +08:00
Ke Bao
c19d84829c Adjust flashinfer workspace size for Qwen2 models (#2879) 2025-01-14 13:34:22 +08:00
Lianmin Zheng
46d4431889 Add a new api configure_logging to allow dumping the requests (#2875) 2025-01-13 14:24:00 -08:00
fzyzcjy
923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) 2025-01-13 11:38:51 -08:00
Xiaoyu Zhang
d08c77c434 Sampling penalties memory interface (#2870) 2025-01-13 23:09:00 +08:00
Lianmin Zheng
c1e097ca66 Revert "Dump requests to a folder" (#2869) 2025-01-13 06:21:25 -08:00
Lzhang-hub
6ec75e626d add qwen2 eagle model (#2863) 2025-01-13 05:29:33 -08:00
Lianmin Zheng
336ff5b9f5 Fix typos in io_struct.py (#2867) 2025-01-13 05:13:02 -08:00
Lianmin Zheng
3b141e1509 Dump requests (#2862) 2025-01-13 04:51:56 -08:00
Lianmin Zheng
6249e4a19e Revert "Integration of TurboMind AWQ" (#2866) 2025-01-13 04:44:39 -08:00
Ke Bao
f3516c2894 Fix quant kernel accuracy issue (#2865) 2025-01-13 20:32:17 +08:00
bjmsong
17de02f98d Integration of TurboMind AWQ (#2828)
Co-authored-by: root <bjmsong@126.com>
2025-01-13 20:14:16 +08:00
Lianmin Zheng
51ab3ccf47 Collect more metrics: num_requests_total (#2859) 2025-01-13 03:57:39 -08:00
Yineng Zhang
41d7e5b7e6 docs: update link (#2857) 2025-01-13 18:40:48 +08:00
kk
42f3909963 Unify sglang coding style (#2856)
Co-authored-by: Lin, Soga <soga.lin@amd.com>
2025-01-13 02:12:44 -08:00
Lianmin Zheng
72c7776355 Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-01-13 01:39:14 -08:00
justdoit
4093aa4660 [Fix]eagle2 health_generate is first request,apiserver will core (#2853) 2025-01-13 01:01:21 -08:00
kk
e808c1df3e Integrate ROCm ater package for ck moe function feasibility (#2854)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
2025-01-13 08:23:07 +00:00
bjmsong
0bb0f76311 Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
2025-01-12 21:17:11 -08:00
Ke Bao
85b2e05770 Add int8 quant kernel (#2848) 2025-01-13 13:16:58 +08:00
Shi Shuai
c4f9707e16 Improve: Token-In Token-Out Usage for RLHF (#2843) 2025-01-11 15:14:26 -08:00
Yineng Zhang
f624901cdd chore: bump v0.4.1.post5 (#2840) 2025-01-11 23:10:02 +08:00
Xiaoyu Zhang
f0e15dc6ab [HotFix] fix fp8 scale load failed in tp>1 (#2837) 2025-01-11 14:34:26 +08:00
Zhiqiang Xie
5d6e9467d4 Cache controller for hierarchical caching (#2804) 2025-01-10 20:22:01 -08:00
justdoit
a47bf39123 [Eagle2] Fix multiple concurrent request crashes (#2730) 2025-01-10 14:00:43 -08:00
TianYu GUO
b170646991 Fix port number overflow (#2826) 2025-01-10 13:44:32 -08:00
Muqi Li
5413ec2bbe [Bugfix] Fix bug in fork logic caused by null text_ (#2835) 2025-01-10 13:37:00 -08:00