Hongpeng Guo
|
e403d23757
|
[Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
|
2025-01-19 14:46:53 -08:00 |
|
Enrique Shockwave
|
3bcf5ecea7
|
support regex in xgrammar backend (#2983)
|
2025-01-20 04:34:41 +08:00 |
|
Yineng Zhang
|
2c05f81f15
|
fix custom op version compatibility (#2988)
|
2025-01-20 04:21:29 +08:00 |
|
Seungduk Kim
|
d77caa2b75
|
[#2812] Make the decode status dict capcity adjustable by a CLI param (#2839)
|
2025-01-19 11:36:53 -08:00 |
|
giorgiopiatti-dfinity
|
8b6a4486ec
|
fix missing revision arg when loading tokenizer (#2982)
|
2025-01-19 11:36:07 -08:00 |
|
Yineng Zhang
|
6ada05d0ed
|
feat: check for is_cuda for sgl_kernel import (#2984)
|
2025-01-19 23:33:04 +08:00 |
|
yizhang2077
|
24cafe3177
|
add config to swtich from vllm custom allreduce to sgl_kernel custom allreduce (#2981)
|
2025-01-19 22:30:38 +08:00 |
|
Yineng Zhang
|
5a176c92df
|
fix deepseek v2 with cpu device (#2975)
|
2025-01-19 21:33:27 +08:00 |
|
Lianmin Zheng
|
23196d5254
|
Simplify logits processor (#2974)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-01-18 23:03:49 -08:00 |
|
Lianmin Zheng
|
93b77c8e8a
|
Fix the request loggings to make it fully able to be easily replayed (#2973)
|
2025-01-18 21:45:00 -08:00 |
|
Lianmin Zheng
|
7906d1d298
|
Remove the unused write_with_records (#2972)
|
2025-01-18 20:20:23 -08:00 |
|
fzyzcjy
|
81d27c8e31
|
Refactor to add TypeBasedDispatcher to simplify dispatching (#2958)
|
2025-01-18 20:13:27 -08:00 |
|
Chang Su
|
4d4cdb3fe7
|
Frontend: better error message handling for FINISH_ABORT in scheduler.py (#2956)
|
2025-01-18 19:37:30 -08:00 |
|
Yang Zheng
|
2bd18e2d76
|
Memory pool: Minor optimize to avoid to (#2901)
|
2025-01-18 19:35:12 -08:00 |
|
Mick
|
3d93f84a00
|
[Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-01-18 14:14:19 -08:00 |
|
Yineng Zhang
|
2add697d7a
|
feat: remove vllm get_rope (#2964)
|
2025-01-18 19:38:01 +08:00 |
|
Ke Bao
|
656dcc1a99
|
Remove fp8 monkey patch (#2960)
|
2025-01-18 15:00:29 +08:00 |
|
Zhiqiang Xie
|
8af7048dcf
|
Query remaining memory dynamically for PrefillAdder (#2941)
|
2025-01-17 20:20:26 -08:00 |
|
bjmsong
|
d3024f4fc8
|
support e4m3 kvcache in qwen2 & add kv scaling facotr json (#2894)
Co-authored-by: bjmsong <bjmsong@126.com>
|
2025-01-18 11:43:22 +08:00 |
|
Yineng Zhang
|
78e5b22f29
|
feat: use get_rope for gemma2 (#2954)
|
2025-01-18 02:57:18 +08:00 |
|
Yineng Zhang
|
7a15e9ad36
|
cleanup models unused import 2/n (#2952)
|
2025-01-18 01:09:19 +08:00 |
|
Yineng Zhang
|
033c715b46
|
cleanup models dependencies 1/n (#2948)
|
2025-01-17 23:46:48 +08:00 |
|
Ke Bao
|
53e6552fed
|
Fix qwen accuracy issue (#2945)
|
2025-01-17 22:35:26 +08:00 |
|
Yineng Zhang
|
5dc54f1a62
|
feat: remove vllm distributed (#2907)
Co-authored-by: Zhangyi <1109276519@qq.com>
|
2025-01-17 22:31:51 +08:00 |
|
Chunyuan WU
|
63051738a9
|
Enable CPU device on SGLang (#2806)
|
2025-01-16 21:22:53 -08:00 |
|
Chang Su
|
a8ccacc8b8
|
[Frontend] Fix request length check and add option to disallow auto truncation in scheduler (#2876)
|
2025-01-16 14:51:19 -08:00 |
|
Lianmin Zheng
|
0427416b59
|
Fix zmq binding (#2930)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
|
2025-01-16 14:36:07 -08:00 |
|
Lianmin Zheng
|
bc6915e3b9
|
Improve type annotation and styles (#2926)
|
2025-01-16 12:51:11 -08:00 |
|
Lianmin Zheng
|
8b6ce52e92
|
Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
|
2025-01-16 11:15:00 -08:00 |
|
Lianmin Zheng
|
93d690617e
|
Simplify the process launch code in server.py (#2923)
|
2025-01-16 07:52:17 -08:00 |
|
Yun Dai
|
e00e5385e0
|
add profiling to bench_one_batch script (#2821)
|
2025-01-16 07:24:24 -08:00 |
|
Rin Intachuen
|
a2f602b541
|
fixed lm_head.weight error for quantized qwen (#2910)
|
2025-01-16 06:51:43 -08:00 |
|
Lianmin Zheng
|
8f2c522aba
|
Improve benchmark scripts and error message printing (#2922)
|
2025-01-16 06:24:31 -08:00 |
|
Yineng Zhang
|
bf8d07a6f9
|
feat: patch linear base (#2915)
|
2025-01-16 18:00:03 +08:00 |
|
yizhang2077
|
767c9dec03
|
adapt custom allreduce for tensorrt llm (#2511)
|
2025-01-16 04:57:35 +08:00 |
|
Lianmin Zheng
|
f65c13b559
|
Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902)
|
2025-01-15 04:54:14 -08:00 |
|
Cody Yu
|
b803b395b7
|
Disable graceful shutdown of tokenizer manager when not in the main thread (#2872)
|
2025-01-15 03:29:33 -08:00 |
|
Yineng Zhang
|
b3e99dfb22
|
chore: bump v0.4.1.post6 (#2899)
|
2025-01-15 16:23:42 +08:00 |
|
Yineng Zhang
|
f5c6c66794
|
feat: support internlm 3 dense (#2888)
|
2025-01-14 19:23:26 +08:00 |
|
Ke Bao
|
cc0485bef2
|
Support w8a8 int8 quantization config (#2881)
|
2025-01-14 17:07:49 +08:00 |
|
Ke Bao
|
c19d84829c
|
Adjust flashinfer workspace size for Qwen2 models (#2879)
|
2025-01-14 13:34:22 +08:00 |
|
Lianmin Zheng
|
46d4431889
|
Add a new api configure_logging to allow dumping the requests (#2875)
|
2025-01-13 14:24:00 -08:00 |
|
fzyzcjy
|
923f518337
|
CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630)
|
2025-01-13 11:38:51 -08:00 |
|
Xiaoyu Zhang
|
d08c77c434
|
Sampling penalties memory interface (#2870)
|
2025-01-13 23:09:00 +08:00 |
|
Lianmin Zheng
|
c1e097ca66
|
Revert "Dump requests to a folder" (#2869)
|
2025-01-13 06:21:25 -08:00 |
|
Lzhang-hub
|
6ec75e626d
|
add qwen2 eagle model (#2863)
|
2025-01-13 05:29:33 -08:00 |
|
Lianmin Zheng
|
336ff5b9f5
|
Fix typos in io_struct.py (#2867)
|
2025-01-13 05:13:02 -08:00 |
|
Lianmin Zheng
|
3b141e1509
|
Dump requests (#2862)
|
2025-01-13 04:51:56 -08:00 |
|
Lianmin Zheng
|
6249e4a19e
|
Revert "Integration of TurboMind AWQ" (#2866)
|
2025-01-13 04:44:39 -08:00 |
|
Ke Bao
|
f3516c2894
|
Fix quant kernel accuracy issue (#2865)
|
2025-01-13 20:32:17 +08:00 |
|