[NVIDIA] Change to use num_local_experts (#8453)

This commit is contained in:
Kaixi Hou
2025-07-28 10:38:19 -07:00
committed by GitHub
parent ccfe52a057
commit 134fa43e19
2 changed files with 3 additions and 2 deletions

View File

@@ -214,7 +214,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--ep-size` | The expert parallelism size. | 1 |
| `--enable-ep-moe` | Enabling expert parallelism for moe. The ep size is equal to the tp size. | False |
| `--enable-deepep-moe` | Enabling DeepEP MoE implementation for EP MoE. | False |
| `--enable-flashinfer-moe` | Enabling Flashinfer MoE implementation. | False |
| `--enable-flashinfer-cutlass-moe` | Enabling Flashinfer Cutlass MoE implementation for high throughput. | False |
| `--enable-flashinfer-trtllm-moe` | Enabling Flashinfer Trtllm MoE implementation for low latency. | False |
| `--deepep-mode` | Select the mode when enable DeepEP MoE, could be `normal`, `low_latency` or `auto`. Default is `auto`, which means `low_latency` for decode batch and `normal` for prefill batch. | auto |
| `--ep-num-redundant-experts` | Allocate this number of redundant experts in expert parallel. | 0 |
| `--ep-dispatch-algorithm` | The algorithm to choose ranks for redundant experts in expert parallel. | None |