Commit Graph

  • 2f47d710ae refine some typo (#3473) Xiaoyu Zhang 2025-02-10 23:35:44 +08:00
  • 4fe92bfca5 fix mla test (#3469) Yineng Zhang 2025-02-10 21:12:00 +08:00
  • d23cb9a01e [Eagle] reduce one draft forward (#3468) Ying Sheng 2025-02-10 04:21:49 -08:00
  • 2d61132374 Support Eagle2 for Triton backend (#3466) Ke Bao 2025-02-10 20:00:42 +08:00
  • cddb1cdf8f chore: bump v0.4.2.post4 (#3459) Yineng Zhang 2025-02-10 14:12:16 +08:00
  • fa1b40e00d use nvcr.io/nvidia/tritonserver:24.04-py3-min as base image (#3457) Yineng Zhang 2025-02-10 13:52:33 +08:00
  • c45cab1c00 [Fix] Fix accuracy bug and refactor codes for lora (#3413) Baizhou Zhang 2025-02-09 21:29:00 -08:00
  • 27c4c9cf52 remove _grouped_size_compiled_for_decode_kernels (#3453) Yineng Zhang 2025-02-10 13:01:21 +08:00
  • 52a492a16e Update contribution_guide.md (#3452) Ying Sheng 2025-02-09 20:53:47 -08:00
  • 36f6fc5093 feat: enable ragged fa3 by default on hopper 12.4+ (#3442) Yineng Zhang 2025-02-10 07:43:01 +08:00
  • d87272750b fix ci (#3441) Yineng Zhang 2025-02-10 04:22:28 +08:00
  • 6239d0b2e7 chore: bump sgl-kernel v0.0.3.post3 (#3440) Yineng Zhang 2025-02-10 04:00:52 +08:00
  • 4cfd3add6d support version in sgl-kernel (#3439) Yineng Zhang 2025-02-10 03:49:52 +08:00
  • 20cf910d8f [docs] Update quantization documentation (#3437) Shi Shuai 2025-02-09 18:39:49 +00:00
  • 0af1d239cb [Docs] Add quantization docs (#3410) Wenxuan Tan 2025-02-09 12:16:21 -06:00
  • 85986bb978 compatible with new outlines (#3435) Yineng Zhang 2025-02-10 01:51:30 +08:00
  • 64c8713573 remove activation dependency in fused_moe (#3433) Yineng Zhang 2025-02-10 01:18:57 +08:00
  • 1646149a83 fix draft cuda graph capture failure (#3431) Yineng Zhang 2025-02-09 23:16:20 +08:00
  • bc72e5bd32 add cuda graph capture failure possible solution (#3430) Yineng Zhang 2025-02-09 22:57:11 +08:00
  • 014cab4dd2 update forward_return_lse (#3425) Yineng Zhang 2025-02-09 20:18:44 +08:00
  • 4d2dbeaca7 remove cutex dependency (#3422) Yineng Zhang 2025-02-09 18:33:20 +08:00
  • 29daf498cd fix cu118 link issue (#3421) Yineng Zhang 2025-02-09 18:16:44 +08:00
  • 6702592d0e [docs] Add multi-node inference example for SLURM in documentation (#3408) Shi Shuai 2025-02-09 05:45:14 +00:00
  • 60abdb3e7c minor: cleanup test_eagle_infer (#3415) Yineng Zhang 2025-02-09 09:34:30 +08:00
  • 7b4e61fff3 [Fix] Fix eagle with disable cuda graph (#3411) Ying Sheng 2025-02-08 16:40:00 -08:00
  • 6222e1c228 add disable cuda graph unit test for eagle 2 (#3412) Yineng Zhang 2025-02-09 08:02:56 +08:00
  • fad315cb8e fix EAGLE 2 non greedy case (#3407) Yineng Zhang 2025-02-09 07:28:34 +08:00
  • f90db8bc07 fix typo Yineng Zhang 2025-02-08 22:16:42 +08:00
  • d8ad597048 Add deepseek-v3 a100 serving example (#3404) Ke Bao 2025-02-08 22:13:52 +08:00
  • 849f58d617 Update fused_moe's benchmark (#3346) GaoYuYang 2025-02-08 21:58:21 +08:00
  • 64480df495 [BUG] fix moe benchmark when bs*seq is small (#3382) yiakwy-xpu-ml-framework-team 2025-02-08 15:39:44 +08:00
  • 4530136e61 Add H20 fp8 w8a8 gemm config (#3386) lukec 2025-02-08 15:36:31 +08:00
  • 0a6f18f068 added amd_configure.md to references (#3275) Zachary Streeter 2025-02-07 10:50:49 -06:00
  • c1f5f99f60 chore: bump v0.4.2.post3 (#3369) Yineng Zhang 2025-02-08 00:20:03 +08:00
  • fa82dfccdd fix EagleVerifyInput (#3378) Yineng Zhang 2025-02-07 22:30:43 +08:00
  • 5da3d21c8b update pr-test ci (#3376) Yineng Zhang 2025-02-07 21:08:35 +08:00
  • f287037673 update sgl-kernel version (#3374) Yineng Zhang 2025-02-07 20:51:06 +08:00
  • f9905d59a8 support speculative decoding kernel in sgl-kernel (#3373) Yineng Zhang 2025-02-07 20:29:51 +08:00
  • 45c87e083f fix undefined symbol cudaGetDriverEntryPointByVersion (#3372) Yineng Zhang 2025-02-07 19:32:45 +08:00
  • 2b1808cec4 update unit test in AMD CI (#3366) Yineng Zhang 2025-02-07 17:25:16 +08:00
  • e868d0b60e update waves_per_eu to 1 (#3356) lizamd 2025-02-06 21:08:06 -08:00
  • 591e751e07 Fix: Runtime error for function calling (#3300) Shi Shuai 2025-02-07 04:52:01 +00:00
  • 40022d075a Feature: Fix the binding error in Llama (#3355) Chayenne 2025-02-06 20:19:24 -08:00
  • 823148e7f0 Docs: Add deepseek usage and add multi-node, credit to lycanlancelot (#3314) Liangjun Song 2025-02-07 14:45:00 +11:00
  • 76ca91dff2 Docs/CI: Enable Fake Finish for Docs Only PR (#3350) Chayenne 2025-02-06 19:33:31 -08:00
  • cdae77b03d optimize moe_align_kernel cuda (#3347) Xiaoyu Zhang 2025-02-07 00:53:46 +08:00
  • adeee15204 fix sgl-kernel build failure on AMD (#3352) Yineng Zhang 2025-02-07 00:35:59 +08:00
  • 6792411e7f [Doc] Add optimization option guide for deepseek v3 (#3349) Ke Bao 2025-02-06 23:28:09 +08:00
  • 7348d9627e add AMD guide for DeepSeek-R1 (#3338) Yineng Zhang 2025-02-06 16:54:40 +08:00
  • 25ed22b685 update pull request template (#3337) Yineng Zhang 2025-02-06 16:48:02 +08:00
  • 200d3b1608 Add sgl-kernel to MI300 CI paths tested. (#3335) saienduri 2025-02-06 00:45:38 -08:00
  • ad3499858e clean moe align block kernel code and add acc test (#3332) Xiaoyu Zhang 2025-02-06 16:42:36 +08:00
  • 32de54ed1a [ROCm] Fix fp8 unrolledx4 matmul kernel. (#3325) Wen-Heng (Jack) Chung 2025-02-05 20:44:15 -06:00
  • 2d9c319594 Docker switch (#3327) saienduri 2025-02-05 18:06:50 -08:00
  • 07e58a2dcb update README (#3324) Yineng Zhang 2025-02-06 07:13:05 +08:00
  • 04d8cd2088 Initial Enablement of CI on MI300 (#3168) saienduri 2025-02-05 10:45:12 -08:00
  • a322051e31 Support custom mask for Triton attention (#3317) Ke Bao 2025-02-06 01:16:02 +08:00
  • de5533341e Update Triton extend backend interface (#3309) Ke Bao 2025-02-05 18:12:22 +08:00
  • 7aad8d1854 chore: bump v0.4.2.post2 (#3313) Yineng Zhang 2025-02-05 17:35:02 +08:00
  • 76fa2d152c Fix lora flashinfer import bug on ROCM (#3312) Baizhou Zhang 2025-02-05 00:36:49 -08:00
  • 7ab84948d8 [ROCm] Logic to decide whether to used manually unrolled kernel. (#3306) Wen-Heng (Jack) Chung 2025-02-04 21:12:20 -06:00
  • 4885b90802 Use forward_cuda to execute custom op for hip platform (#3305) kk 2025-02-05 10:58:17 +08:00
  • c2723a42a5 [ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. (#3299) Wen-Heng (Jack) Chung 2025-02-04 17:15:40 -06:00
  • c7256ca836 [ROCm] Add tuning configs for AMD Radeon Graphics. (#3294) Wen-Heng (Jack) Chung 2025-02-04 12:34:57 -06:00
  • 6186a8f889 update flashinfer install index url (#3293) Yineng Zhang 2025-02-05 00:44:35 +08:00
  • a07364ccc5 Update Triton decode backend interface (#3292) Ke Bao 2025-02-04 23:26:04 +08:00
  • 2c1a695ff1 ROCm: sgl-kernel enablement starting with sgl_moe_align_block (#3287) HAI 2025-02-04 05:44:44 -08:00
  • d39899e85c upgrade flashinfer v0.2.0.post2 (#3288) Yineng Zhang 2025-02-04 21:41:40 +08:00
  • 70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161) Baizhou Zhang 2025-02-03 22:09:13 -08:00
  • 7b5a374114 Update server args doc (#3273) simveit 2025-02-04 00:39:41 +01:00
  • 4b6f62e2bc add Atlas Cloud for Adoption and Sponsorship (#3276) Yineng Zhang 2025-02-04 05:31:30 +08:00
  • 897e2e253a add Nebius for Adoption and Sponsorship (#3274) Yineng Zhang 2025-02-04 04:41:26 +08:00
  • d54cee1441 adding Triton configs for DeepSeekV3 on Blackwell (#3272) kushanam 2025-02-03 12:12:09 -08:00
  • 00fa7d0417 add copyright for sgl-kernel (#3270) Yineng Zhang 2025-02-03 21:34:44 +08:00
  • 013021b6a1 refactor EAGLE 2 (#3269) Yineng Zhang 2025-02-03 20:52:30 +08:00
  • 3c8ac78dc1 optimize test_fused_moe style (#3268) Xiaoyu Zhang 2025-02-03 18:56:18 +08:00
  • 455bfe8dd3 Add a Doc about guide on nvidia jetson #3182 (#3205) Liangjun Song 2025-02-03 15:29:10 +11:00
  • 28b0a62bb3 Bug: Fix min_p sampling crash when using flashinfer backend (#3207) zifeitong 2025-02-02 15:36:07 -08:00
  • 566d61d90f ROCm: bump 6.3.0 (#3259) HAI 2025-02-02 12:13:40 -08:00
  • 55f5fc68ac Docs: Update accuracy evaluation (#3261) Chayenne 2025-02-02 11:14:59 -08:00
  • c27c378a19 docs/accuracy evaluation (#3114) simveit 2025-02-02 20:01:39 +01:00
  • d9eb9358cc Tune paged attention parameters for AMD GPU. (#3255) Wen-Heng (Jack) Chung 2025-02-01 19:29:45 -06:00
  • 959dca4fc7 use srt VocabParallelEmbedding (#3252) Yineng Zhang 2025-02-01 22:23:09 +08:00
  • f2b3a3188e Update README Yineng Zhang 2025-02-01 21:19:15 +08:00
  • ad6740977b add contact us in README (#3251) Yineng Zhang 2025-02-01 19:47:44 +08:00
  • 8db776f049 support QuickGELU (#3250) Yineng Zhang 2025-02-01 19:31:47 +08:00
  • 4eb4b401cc update and simplify CustomOp (#3249) Yineng Zhang 2025-02-01 18:56:44 +08:00
  • 17dbf976c5 update ENV to ROCm dockers (#3248) HAI 2025-02-01 01:27:43 -08:00
  • 5317902670 Add test for fp8 torch compile (#3246) Ke Bao 2025-02-01 16:07:54 +08:00
  • d7c0b32f4d [Docs] Add more details to profiling docs (#3221) Wenxuan Tan 2025-01-31 17:59:28 -06:00
  • 7b020cca2d add tuning block wise fp8 (#3242) Yineng Zhang 2025-02-01 03:58:18 +08:00
  • 7876279ea7 update cutlass dependency (#3240) Yineng Zhang 2025-02-01 03:13:44 +08:00
  • 34e405e01f update sgl-kernel version for sglang (#3238) Yineng Zhang 2025-02-01 02:14:41 +08:00
  • 1ebe1d6de5 Optimize MoE topk with torch compile (#3236) Ke Bao 2025-02-01 01:36:50 +08:00
  • 7811bfdaa7 compatible with flashinfer v0.2 (#3235) Yineng Zhang 2025-02-01 01:32:18 +08:00
  • 656f7fc1bc Docs: Quick fix for Speculative_decoding doc (#3228) Jhin 2025-01-31 10:30:40 -06:00
  • cf0f7eafe6 chore: bump v0.4.2.post1 (#3233) Yineng Zhang 2025-01-31 20:35:55 +08:00
  • b49d6d0fee support 12.5 CUDA runtime (#3231) Yineng Zhang 2025-01-31 20:31:38 +08:00
  • c02e313914 Fix block wise fp8 torch compile (#3232) Ke Bao 2025-01-31 19:56:02 +08:00
  • 734daedd8f [fix] Clamp logprob with dtype min to prevent -inf (#3224) Byron Hsu 2025-01-31 01:04:04 -08:00