Lianmin Zheng
|
665815969a
|
Enable cuda graph by default (#612)
|
2024-07-13 05:29:46 -07:00 |
|
Lianmin Zheng
|
396a69240f
|
Cleanup attention backend: flashinfer and triton (#611)
|
2024-07-12 18:21:11 -07:00 |
|
Lianmin Zheng
|
af4e7910e7
|
Clean up the usage of flashinfer (#610)
|
2024-07-12 13:00:03 -07:00 |
|
Lianmin Zheng
|
519e20cfda
|
Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py (#609)
|
2024-07-12 12:28:09 -07:00 |
|
Lianmin Zheng
|
d9a6902986
|
Fix bench latency (#607)
|
2024-07-11 14:37:01 -07:00 |
|
Lianmin Zheng
|
ad872feb14
|
bump version to 0.1.19
|
2024-07-09 02:23:14 -07:00 |
|
Lianmin Zheng
|
da2e5d6546
|
Fix the default argument of OpenAI Chat completion (#605)
|
2024-07-09 02:04:43 -07:00 |
|
胡译文
|
02b7258658
|
[Feat] Expose logprob options to sgl.gen API (#503)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-07-09 00:35:39 -07:00 |
|
prophe
|
d557e9f3b7
|
Update chat template for qwen and yi-1.5. (#530)
|
2024-07-08 23:55:44 -07:00 |
|
Tommy Yang
|
740c46a152
|
Add Qwen2 MoE support (#603)
|
2024-07-08 23:44:59 -07:00 |
|
Tommy Yang
|
b38687226a
|
Make sglang compat with vllm 0.5.1 (#598)
|
2024-07-08 23:44:22 -07:00 |
|
Pan Lyu
|
710f614ebe
|
add minicpm support (#602)
|
2024-07-08 23:27:04 -07:00 |
|
Liangsheng Yin
|
f25b76c02a
|
add LogitsMetadata (#604)
|
2024-07-08 17:46:55 -07:00 |
|
Mingyi
|
f4e885b7c3
|
Reduce number of workspaces (#601)
|
2024-07-07 19:35:22 -07:00 |
|
Liangsheng Yin
|
0877f1e75b
|
Fix streaming (#600)
|
2024-07-07 01:55:58 -07:00 |
|
Liangsheng Yin
|
5304b4ef58
|
Add --enable-p2p-check option (#599)
|
2024-07-06 23:34:10 -07:00 |
|
Pan Lyu
|
26908d9568
|
* fix(detokenizer_manager.py): fix truncated decoded output (#586)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
|
2024-07-06 14:53:22 -07:00 |
|
Mingyi
|
c0982ac553
|
Fix Llava model (#594)
|
2024-07-06 00:58:46 -07:00 |
|
Ying Sheng
|
dc1b8bcfaa
|
Format (#593)
|
2024-07-05 10:06:17 -07:00 |
|
Ying Sheng
|
5a57b8addd
|
Add Gemma2 (#592)
|
2024-07-05 09:48:54 -07:00 |
|
Ying Sheng
|
2f11936f95
|
bump version to 0.1.18
|
2024-07-04 06:27:29 +00:00 |
|
Lianmin Zheng
|
63fbef9876
|
fix flashinfer & http log level
|
2024-07-03 23:19:33 -07:00 |
|
Ying Sheng
|
2a754e57b0
|
2x performance improvement for large prefill & Fix workspace conflicts (#579)
|
2024-07-03 16:14:57 -07:00 |
|
Liangsheng Yin
|
96c503eb60
|
fix the broken server args (#585)
|
2024-07-03 16:01:19 -07:00 |
|
Chen Xuechen Li
|
441cca773d
|
support gptj style rope in llama
|
2024-07-03 22:06:58 +00:00 |
|
Lianmin Zheng
|
c7709d3abe
|
Update install commands (#583)
|
2024-07-03 02:10:59 -07:00 |
|
Ying Sheng
|
9380f50ff9
|
Turn on flashinfer by default (#578)
|
2024-07-02 02:25:07 -07:00 |
|
Daniel Hernandez Garcia
|
95dc093b19
|
[BugFix] gemma loading weights "lm_head.weight" key error (#577)
|
2024-07-01 22:10:07 -07:00 |
|
Yueyang Pan
|
d9ac639202
|
Fix flashinfer version (#576)
|
2024-07-01 22:08:39 -07:00 |
|
Ying Sheng
|
75b31a2a88
|
Update run_batch interface and max_prefill_tokens (#574)
|
2024-06-30 18:26:04 -07:00 |
|
sglang
|
11616fc6bd
|
Minor fix in compiler & format (#545)
|
2024-06-29 23:42:14 -07:00 |
|
Ying Sheng
|
9ce89bc14b
|
Update benchmark script (#571)
|
2024-06-28 00:44:22 -07:00 |
|
Lianmin Zheng
|
badf3fa020
|
Expose dtype argument (#569)
|
2024-06-27 23:30:39 -07:00 |
|
Lianmin Zheng
|
2e6e62e156
|
Increase the number of thread limitation for tp worker managers. (#567)
|
2024-06-26 09:33:45 -07:00 |
|
Lianmin Zheng
|
a385ee27bd
|
Warmup cublas (#566)
|
2024-06-25 12:46:00 -07:00 |
|
Lianmin Zheng
|
eb1ae6ae0c
|
Add sglang.bench_latency for offline benchmark (#564)
|
2024-06-25 03:38:04 -07:00 |
|
Lianmin Zheng
|
2187f36237
|
Add a new arguments log_level_http to control the HTTP logging (#563)
|
2024-06-25 01:16:20 -07:00 |
|
Lianmin Zheng
|
9465b668b9
|
Allow running with vllm==0.4.3 (#561)
|
2024-06-24 15:24:21 -07:00 |
|
Lianmin Zheng
|
1fa15099d8
|
Add LlamaForClassification (#559)
|
2024-06-22 00:49:31 -07:00 |
|
Lianmin Zheng
|
303ef8883e
|
Clean up logits processor (#558)
|
2024-06-22 00:25:24 -07:00 |
|
Lianmin Zheng
|
e94e60d6fb
|
make flashinfer workspace larger
|
2024-06-21 17:32:36 -07:00 |
|
Lianmin Zheng
|
d2f8bfb2e1
|
Follow-up fixes for flashinfer 0.0.5 (#556)
|
2024-06-20 23:19:52 -07:00 |
|
Lianmin Zheng
|
b7e2f800ac
|
Update flashinfer to 0.0.5 (#554)
|
2024-06-20 20:29:06 -07:00 |
|
Ying Sheng
|
09593e9bc9
|
Multi-node Tensor Parallelism (#550)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-06-17 20:41:24 -07:00 |
|
Lianmin Zheng
|
53a7ebd89a
|
Update fused_moe (#553)
|
2024-06-17 09:47:58 -07:00 |
|
Liangsheng Yin
|
ad5f04d6ce
|
Fix the Jump-Forward with Chinese (#551)
|
2024-06-16 21:45:04 +08:00 |
|
Qubitium-modelcloud
|
bbec01c9aa
|
Fix tp worker only checking req[0] for stream (#546)
|
2024-06-14 22:56:10 -07:00 |
|
Ying Sheng
|
fb9296f0ed
|
Higher priority for user input of max_prefill_tokens & format (#540)
|
2024-06-12 21:48:40 -07:00 |
|
Ying Sheng
|
1374334d38
|
Fix dependency & crash issues (#539)
|
2024-06-12 21:23:19 -07:00 |
|
Lianmin Zheng
|
94aead9e8d
|
Fix dependency (#538)
|
2024-06-12 13:17:35 -07:00 |
|