Commit Graph

  • 20b765a26e Model: Support Qwen 72B RM model. (#3772) simveit 2025-02-21 23:38:21 +01:00
  • e310722266 Docs: Update offline_engine_api and add links (#3773) Chayenne 2025-02-21 14:15:52 -08:00
  • e074e76b31 docs: Add offline engine launch example and documentation (#3771) Shi Shuai 2025-02-21 19:25:52 +00:00
  • 4592afc27d Docs: Fix layout to docs (#3733) simveit 2025-02-21 20:24:13 +01:00
  • 9af0e21ef5 [bug] fixed batch api for DeepSeek V3/R1 (#3754) Shenggui Li 2025-02-22 02:28:16 +08:00
  • c7c79b16cd [Fix] OpenAI API adapter tokenizer encoding (#3432) Shi Shuai 2025-02-21 17:24:15 +00:00
  • d8d75d256a Change description of nvidia jetson docs (#3761) Shakhizat Nurgaliyev 2025-02-21 17:44:22 +05:00
  • 1df6eabd5d feat: Add SageMaker support (#3740) Andrew Smith 2025-02-21 22:31:09 +11:00
  • 0c227ee373 feat: update grouped_topk to support softmax and sigmoid (#3680) zixuanzhang226 2025-02-21 00:30:15 -08:00
  • 5c54ef0352 AMD/ROCm: update AITER repo to ROCm/aiter (#3747) HAI 2025-02-21 00:18:08 -08:00
  • c6a4852136 [docs] added torch.compile cache to dpsk manual (#3737) Shenggui Li 2025-02-21 16:11:40 +08:00
  • 4f678c87d8 docs: fix 404 link (#3588) Trayvon Pan 2025-02-21 07:15:57 +08:00
  • e79f7420be [Fix] Fix bugs and refactor codes in lora for better scalability. (#3652) aoshen524 2025-02-20 14:51:57 -05:00
  • ac05310098 [Docs] Modify ep related server args and remove cublas part of deepseek (#3732) Baizhou Zhang 2025-02-20 11:37:56 -08:00
  • d5d80ab477 [Bugfix] Fix scores mask for moe topk (#3705) chenxiaobing 2025-02-21 02:17:23 +08:00
  • ddcf9fe3be Optimize triton attention custom mask (#3731) Ke Bao 2025-02-21 00:54:41 +08:00
  • 6252ade985 revert BLOCK and num_warps on HIP (#3722) HAI 2025-02-20 07:30:18 -08:00
  • 1eb8eade2b add control for cutlass fp8 blockwise gemm (#3727) yizhang2077 2025-02-20 16:10:35 +08:00
  • 3c7bfd7eab Docs: Fix layout with sub-section (#3710) Chayenne 2025-02-19 15:44:30 -08:00
  • bb121214c2 Variance measure for reasoning benchmark (#3677) simveit 2025-02-19 20:49:49 +01:00
  • 55de40f782 [Docs]: Fix Multi-User Port Allocation Conflicts (#3601) Shi Shuai 2025-02-19 19:15:44 +00:00
  • 6b0aeb58fd [moe] optim: reduce memory consumption in fused_moe (#3692) Cheng Wan 2025-02-19 13:25:05 -05:00
  • bb3e526823 [k8s] remove unnecessary hostIPC for security concern (#3700) Peter Pan 2025-02-20 02:11:21 +08:00
  • f93e915817 [Docs] Add SkyPilot DeepSeek example (#3706) Zhanghao Wu 2025-02-19 10:10:23 -08:00
  • 55a7ec388f use warp shuffle style reduce and flashinfer vectorize (#3628) Xiaoyu Zhang 2025-02-19 20:53:51 +08:00
  • fe0673f1cc set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed (#3698) Yineng Zhang 2025-02-19 20:50:22 +08:00
  • 99c1b9d2ee fix: apply cache size limit of attention mask for VisionAttention (#3657) Mick 2025-02-19 20:16:48 +08:00
  • 634a3561ac AMD Prefill optimize (#3665) who who who 2025-02-19 01:35:58 +08:00
  • 424848d26f fix: remove dependency on latest transformers impl (#3635) Mick 2025-02-19 01:14:11 +08:00
  • e5ce395a6c Fix draft decode max batch size (#3676) Ke Bao 2025-02-18 23:03:26 +08:00
  • f983213a1f update pr-test (#3663) Yineng Zhang 2025-02-18 17:23:43 +08:00
  • 67fc595bb8 [Feature] Apply Cublas Grouped Gemm kernel (#3629) Baizhou Zhang 2025-02-17 23:18:31 -08:00
  • 07ab4d4a2d fix #3654 Yineng Zhang 2025-02-18 15:16:16 +08:00
  • 522e18eaeb Update amd docker image. (#3654) saienduri 2025-02-17 20:12:55 -08:00
  • c51dc2cc8d Docs: Deploy multi-node inference (LWS method) using sglang in a K8s cluster (#3624) ybyang 2025-02-18 10:14:20 +08:00
  • ddf39d3fce [ROCm] Optimal MOE Tuning for AMD Radeon Graphics (#3567) yigex 2025-02-18 09:54:10 +08:00
  • 2eab113206 [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. (#3616) Wen-Heng (Jack) Chung 2025-02-17 17:54:18 -06:00
  • 058d199d4e use transformers 4.48.3 (#3650) Yineng Zhang 2025-02-18 04:40:47 +08:00
  • a5375adc3a chore: bump v0.4.3.post2 (#3645) Yineng Zhang 2025-02-18 02:48:30 +08:00
  • 75d171a9c5 chore: update flashinfer v0.2.1.post2 (#3644) Yineng Zhang 2025-02-18 02:47:42 +08:00
  • 714f3e6362 feat: support flashinfer mla with prefix cache (#3643) Yineng Zhang 2025-02-18 02:06:43 +08:00
  • c38f3aed24 support multi-gpu block-gemm tuning (#3639) Xiaoyu Zhang 2025-02-18 00:00:35 +08:00
  • 2e6be53e7d fix Dockerfile.rocm Yineng Zhang 2025-02-17 22:13:03 +08:00
  • e782eb7e6a chore: bump v0.4.3.post1 (#3638) Yineng Zhang 2025-02-17 21:58:19 +08:00
  • e319153be8 update unit test (#3636) Yineng Zhang 2025-02-17 21:06:10 +08:00
  • 32b44d2fca add mtp unit test (#3634) Yineng Zhang 2025-02-17 19:04:07 +08:00
  • 5f1a485d9e Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… (#3632) Yineng Zhang 2025-02-17 18:01:21 +08:00
  • c9565e49e7 [docker] added rdma support (#3619) Shenggui Li 2025-02-17 15:36:16 +08:00
  • d03c4c25a7 [docs] Update sampling_params.md (#3617) Shi Shuai 2025-02-17 02:52:30 +00:00
  • 8f13377dea Draft of updated doc for sampling params. (#3260) simveit 2025-02-16 23:28:22 +01:00
  • 3d4a8f9bc0 Benchmark for reasoning models (#3532) simveit 2025-02-16 20:07:30 +01:00
  • 7474bed883 Update to latest amd image. (#3597) saienduri 2025-02-16 08:29:47 -08:00
  • 03caefeb51 [ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. (#3535) Wen-Heng (Jack) Chung 2025-02-16 03:40:38 -06:00
  • bcc213df61 Model: Support Qwen 2.5 vl (#3258) Mick 2025-02-16 16:58:53 +08:00
  • 39416e394a fix lockfile and port_registry file permission error (#3598) Jiada Li 2025-02-15 22:14:45 -05:00
  • 231c40d859 [docs] added favicon to sphinx html (#3564) Shenggui Li 2025-02-16 02:21:21 +08:00
  • bbc47c348f fix apply_token_bitmask_inplace_cuda (#3594) Yineng Zhang 2025-02-15 23:55:08 +08:00
  • dfce926921 fix high qps crash when enable mtp (#3592) Yineng Zhang 2025-02-15 23:11:28 +08:00
  • 6718b10996 fix eagle unit test (#3591) Yineng Zhang 2025-02-15 23:10:48 +08:00
  • 7711ac6ed0 doc: emphasize and notify the usage of chat_template (#3589) Mick 2025-02-15 16:10:32 +08:00
  • 7443197a63 [CI] Improve Docs CI Efficiency (#3587) Shi Shuai 2025-02-15 03:57:00 +00:00
  • 862dd76c76 Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 (#3582) Ke Bao 2025-02-15 05:28:34 +08:00
  • fb4c9c3a30 [fix] added support for vlm in offline inference (#3548) Shenggui Li 2025-02-15 05:27:29 +08:00
  • d973c78e79 ROCm docker: triton update (#3584) HAI 2025-02-14 10:26:32 -08:00
  • 6ce6eabbcc Copy config files for MI300X to support in virtualized environments (#3505) Jesse Lopez 2025-02-14 09:23:32 -08:00
  • 4e23c961e8 docs: update install (#3581) Yineng Zhang 2025-02-14 18:54:50 +08:00
  • 3efbdf68b9 fix sgl-kernel codestyle (#3563) Xiaoyu Zhang 2025-02-14 18:05:52 +08:00
  • 6cc309557a Add support for OpenAI API o1 model (#3363) Chuyue Sun 2025-02-13 19:43:00 -08:00
  • 31eec35ba8 fix doc (#3558) Yineng Zhang 2025-02-14 10:11:31 +08:00
  • ac963be234 update flashinfer-python (#3557) Yineng Zhang 2025-02-14 09:52:56 +08:00
  • e0b9a423c8 chore: bump v0.4.3 (#3556) Yineng Zhang 2025-02-14 09:43:14 +08:00
  • e082142519 chore: bump 0.0.3.post6 sgl-kernel (#3555) Yineng Zhang 2025-02-14 08:55:15 +08:00
  • 70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) Yineng Zhang 2025-02-14 08:50:14 +08:00
  • 368de3661e Update install docs (#3553) simveit 2025-02-13 22:42:51 +01:00
  • 20de05a753 update README (#3543) Yineng Zhang 2025-02-13 17:22:11 +08:00
  • f076328bb7 fix moe_align_kernel shm init not sync bug (#3534) Xiaoyu Zhang 2025-02-13 16:47:00 +08:00
  • bf2a70872e Update DeepSeek V3 Doc (#3541) Jhin 2025-02-13 01:15:37 -06:00
  • 871a4aa1bf [ROCm] Add ROCm tuning configs for AMD Instinct MI325X. (#3536) Wen-Heng (Jack) Chung 2025-02-12 22:09:36 -06:00
  • 98eecbda54 integrate blockwise fp8 kernel (#3529) yizhang2077 2025-02-13 04:39:33 +08:00
  • 4430c0a513 chore: bump 0.0.3.post5 sgl-kernel (#3530) Yineng Zhang 2025-02-13 01:51:46 +08:00
  • 640363ad20 support blockwise fp8 matmul kernel (#3267) yizhang2077 2025-02-13 01:49:33 +08:00
  • 8616357a97 Fix deepseek awq v3 (#3450) Liangsheng Yin 2025-02-12 22:09:52 +08:00
  • 8adbc78b30 added llama and cleaned up (#3503) Zachary Streeter 2025-02-12 04:48:30 -06:00
  • 45e3a7bc41 use sgl_per_token_group_quant_fp8 kernel (#3493) Xiaoyu Zhang 2025-02-12 18:40:42 +08:00
  • b96e92e6e6 chore: bump 0.0.3.post4 sgl-kernel (#3523) Yineng Zhang 2025-02-12 17:28:36 +08:00
  • 693c2600e0 refine deepseek_v3 launch server doc (#3522) Xiaoyu Zhang 2025-02-12 17:27:07 +08:00
  • ced680663c doc: Support a new vLM (#3405) Mick 2025-02-12 16:43:14 +08:00
  • b8318aec48 Make NCCL NVLS configurable (#3502) Ata Fatahi 2025-02-11 14:25:06 -05:00
  • 2f48221033 docs: update install Yineng Zhang 2025-02-12 03:13:31 +08:00
  • d81ac4434e MI30x: More graph captures for larger batch sizes and concurrencies (#3420) HAI 2025-02-11 11:04:38 -08:00
  • 2491cc928d add deepseek-v3 amd docker command (#3495) Zachary Streeter 2025-02-11 13:03:08 -06:00
  • 67c5de9286 fix router typo (#3496) Didier Durand 2025-02-11 20:00:57 +01:00
  • 1e2cf2b541 fix server_arguments typo (#3499) Didier Durand 2025-02-11 19:59:53 +01:00
  • 9490d15772 fix supported_models Qwen typo (#3498) Didier Durand 2025-02-11 19:59:18 +01:00
  • eefcbdd353 fix deepseek_v3 typo (#3497) Didier Durand 2025-02-11 19:58:36 +01:00
  • 7e6d5fc694 Support Eagle cuda graph for Triton backend (#3500) Ke Bao 2025-02-12 02:27:45 +08:00
  • cadd5dbe6a Tune MI300X fused MoE Triton kernel JSON config. (#3492) Wen-Heng (Jack) Chung 2025-02-11 12:27:25 -06:00
  • bb418ced80 optimize per token group quant fp8 (#3490) Xiaoyu Zhang 2025-02-11 22:19:05 +08:00
  • fdf04a1426 [ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics (#3418) yigex 2025-02-11 15:55:04 +08:00
  • 5f0e7de339 [Feat] Return hidden states (experimental) (#3364) Jackmin801 2025-02-10 15:54:37 -08:00