Add Benchmark for DeepGEMM Group GEMM (#3993)

2025-03-02 17:47:21 -08:00
parent 9cf4077294
commit b7e274f2d9
3 changed files with 502 additions and 2 deletions
--- a/benchmark/kernels/deepseek/README.md
+++ b/benchmark/kernels/deepseek/README.md
@@ -1,6 +1,19 @@
 ## DeepSeek kernels benchmark

+
+### Prerequisites
+- You should install [DeepGemm](https://github.com/deepseek-ai/DeepGEMM) from source before run `benchmark_deepgemm_fp8_gemm.py` and `benchmark_deepgemm_fp8_group_gemm.py`.
+
+### Benchmark
 - `benchmark_deepgemm_fp8_gemm.py`
-    - You should install [DeepGemm](https://github.com/deepseek-ai/DeepGEMM) from source before run `benchmark_deepgemm_fp8_gemm.py`.
-    - You can use the `--run_correctness` parameter to verify all kernels results's correctness.
+    ```bash
+    python benchmark_deepgemm_fp8_gemm.py --run_correctness --tp_size 1
+    ```
+
+- `benchmark_deepgemm_fp8_group_gemm.py`
+    ```bash
+    python benchmark_deepgemm_fp8_group_gemm.py --run_correctness --tp_size 1
+    ```
+
+ - You can use the `--run_correctness` parameter to verify all kernels results's correctness.
    - You can use the `--tp_size` parameter to benchmark all FP8 w8a8 block-wise matrix multiplications involved in DeepSeek V3/R1 under the current tensor parallelism (TP) setting. This benchmark compares DeepSeek's open-source [DeepGemm](https://github.com/deepseek-ai/DeepGEMM) implementation with SGLang's and VLLM Triton implementation.