[Docs] Modify ep related server args and remove cublas part of deepseek (#3732)
This commit is contained in:
@@ -77,21 +77,3 @@ Overall, with these optimizations, we have achieved up to a 7x acceleration in o
|
||||
- **Weight**: Per-128x128-block quantization for better numerical stability.
|
||||
|
||||
**Usage**: turn on by default for DeepSeek V3 models.
|
||||
|
||||
### Cublas Grouped Gemm
|
||||
|
||||
**Description**: [Grouped Gemm API](https://docs.nvidia.com/cuda/cublas/index.html#cublasgemmgroupedbatchedex) provided by Cublas 12.5 is attached to SGLang for acceleration of
|
||||
settings where a group of matrix multiplication with different shapes needs to be executed. Typical examples are expert parallel in MoE layers, and lora modules in multi-serving Lora layers.
|
||||
|
||||
**Usage**: SGLang currently only supports Pytorch 2.5, which is installed with Cuda 12.4 packages together. Users need to work on a Cuda environment >= 12.5 and forcely upgrade the Cublas package in the following way:
|
||||
|
||||
1. Make sure the system Cuda version is >= 12.5 with `nvcc -V`
|
||||
2. Install sglang under instruction of [official document ](https://docs.sglang.ai/start/install.html)
|
||||
3. Reinstall cublas 12.5 through `pip install nvidia-cublas-cu12==12.5.3.2` so that the cublas package is upgraded
|
||||
4. Compile the new sgl-kernel library with `cd sgl-kernel && make build`
|
||||
|
||||
Then the cublas grouped gemm kernel can be imported with
|
||||
```python
|
||||
from sgl_kernel import cublas_grouped_gemm
|
||||
```
|
||||
Currently Cublas only support grouped gemm kernel for fp16/bf16/fp32 tensors, so fp8 tensors cannot be applied.
|
||||
|
||||
Reference in New Issue
Block a user