Commit Graph

17 Commits

Author SHA1 Message Date
Xinyu Dong
070bfa4a73 [Bugfix] Fixed Kunlun Graph Failed (#193)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
2026-02-11 18:52:18 +08:00
WANG HAO
bd8c999335 Further optimize multi-lora inference,LoRA-enabled performance achieves 80%+ of non-LoRA performance (#190)
* optimize lora inference

Signed-off-by: wanghao <wanghao@example.com>

* further optimize multi-lora inference,LoRA-enabled performance achieves 80%+ of non-LoRA performance

Signed-off-by: wanghao <wanghao@example.com>

---------

Signed-off-by: wanghao <wanghao@example.com>
Co-authored-by: wanghao <wanghao@example.com>
2026-02-11 12:04:14 +08:00
Li Wei
71bd70ad6c [Feature] support compressed-tensors w4a16 quantization (#154)
- native int4 kimi model inference is supported

Signed-off-by: Li Wei <liwei.109@outlook.com>
2026-01-27 19:56:22 +08:00
Shiwen Tang
0711c1abfa [Feature] Support AWQ MoE W4A16 Quantization (#142)
Signed-off-by: tangshiwen <tangshiwen@baidu.com>
Co-authored-by: Li Wei <liwei.109@outlook.com>
2026-01-26 18:56:05 +08:00
fromck
0ce5f1a3f7 Add kernels to optimize RoPE and the decoding stage (#143)
Co-authored-by: chengxiaokang <chengxiaokang@baidu.com>
2026-01-23 10:29:52 +08:00
fromck
71a5a04e0a [Misc]Specify that DS32 only supports --kv-cache-dtype bfloat16 (#119)
* [Kernel] add kernels to torch.ops

* [Misc]Specify that DS only supports --kv-cache-dtype bfloat16

---------

Co-authored-by: chengxiaokang <chengxiaokang@baidu.com>
2026-01-17 16:52:02 +08:00
roger-lcc
37cc307322 register apply_repetition_penalties_ in custom_op (#110)
* fix qwen2_vl for 0.11.0

* register apply_repetition_penalties_ in custom_op

---------

Co-authored-by: luochencheng <luochencheng@baidu.com>
2026-01-13 20:22:14 +08:00
baoqian426
2c9b176e6e [Feature] use for dp (#90) 2026-01-08 11:05:48 +08:00
tangshiwen
f811ae968a [fix] resolve cutlass_scaled_mm inference error 2026-01-06 20:52:12 +08:00
Li Wei
9533f68e99 [fix]matmul not support cuda graph 2026-01-06 17:32:45 +08:00
Li Wei
515a4eeda9 [dev] support compressed-tensors w8a8 quantization (#75)
* [dev] support compressed-tensors w8a8 quantization

Co-authored-by: Li Wei <liwei.109@outlook.com>

* [refact]update KunlunScaleMMKernel impl

* [rebase]resolve conflicts and remove redundant code

---------

Co-authored-by: tangshiwen <tangshiwen@baidu.com>
2026-01-06 13:51:53 +08:00
baoqian426
ee0f50e68f [Feature] support deepseek v3/r1/v3.2 (#78)
* [Feature] support deepseek v3/r1/v3.2

* fix gpt_oss

* update readme

* update readme

---------

Co-authored-by: hanhaowen <hanhaowen@baidu.com>
2026-01-05 22:55:35 +08:00
hanhaowen
b015bb76fd remove qwen2.py llama.py fix llama output 2025-12-31 11:39:37 +08:00
Xinyu Dong
b3c30a3cb9 [Feature] Support XiaoMi MIMO Flash V2 (#62)
* [Feature] Support MIMO Flash V2
2025-12-31 10:16:33 +08:00
Li Wei
6546323c71 [dev] support AWQ/GPTQ quantization for dense models 2025-12-24 13:46:06 +08:00
chenyili
7c22d621fb 提交vllm0.11.0开发分支 2025-12-10 17:51:24 +08:00
dongxinyu03
c728e52505 Initial commit for vLLM-Kunlun Plugin 2025-12-10 12:05:39 +08:00