## What this PR does / why we need it?
This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.
### Changes
- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links
## Does this PR introduce _any_ user-facing change?
Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.
## How was this patch tested?
Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.
---
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
vLLM Ascend Plugin documents
Live doc: https://docs.vllm.ai/projects/ascend
Build the docs
# Install dependencies.
pip install -r requirements-docs.txt
# Build the docs.
make clean
make html
# Build the docs with translation
make intl
# Open the docs with your browser
python -m http.server -d _build/html/
Launch your browser and open:
- English version: http://localhost:8000
- Chinese version: http://localhost:8000/zh_CN