Commit Graph

5 Commits

Author SHA1 Message Date
starmountain1997
bc1622338c [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536)
### What this PR does / why we need it?

This version has no divisibility constraint between tp and mtp+1.
However, cudagraph_capture_sizes must be a common multiple of tp and
mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed
cudagraph_capture_sizes.

We added a long-sequence test (64k input, 3k output) for the two-node
mixed deployment scenario. Due to the excessive time required for
performance benchmarking, we are only verifying functionality. The
single-node scenario is skipped because VRAM limitations prevent
launching the model with a max-model-len of 68,000.

and we also add aime2025 test for dual-node deepseek 3.2 nightly test.

### How was this patch tested?

test at nightly environment.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-26 10:58:50 +08:00
starmountain1997
6c73b88dd6 [CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115)
### What this PR does / why we need it?

This PR enables FLASHCOMM1 communication optimization with layer
sharding for DeepSeek-V3.2 W8A8 model testing to
  validate PR #5702. The changes include:

  1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1
  improves performance for distributed inference
2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"]
4. Update baselines: Adjust performance baselines to reflect the
improvements from FLASHCOMM1 and layer sharding

### Does this PR introduce _any_ user-facing change?

No. This is a CI/test-only change that enables new communication
optimization features for testing purposes.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-23 19:48:37 +08:00
Nengjun Ma
ab676413e6 Default enable MLAPO (#5952)
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.

### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model

The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 14055.8836 ms   │
├─────────────────────────┤
│                ITL                         │ 66.8171 ms.          │
├─────────────────────────┤
│ Output Token Throughput  │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 3753.1547 ms   │
├─────────────────────────┤
│                ITL.                        │ 61.4236  ms.       │
├─────────────────────────┤
│ Output Token Throughput  │ 125.2075 token/s│
├─────────────────────────┤

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-22 09:26:39 +08:00
starmountain1997
0664c6e67a [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921)
### What this PR does / why we need it?

#### Documentation Improvements

New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.

#### CI and Testing Updates

Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.

including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.

### Does this PR introduce any user-facing change?

Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.

### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.

Environment: * vLLM version: v0.13.0

Base Commit:
[11b6af5](11b6af5280)

Hardware: Ascend A3/A2 multi-node cluster.

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-20 12:40:54 +08:00
starmountain1997
086c093347 [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#5371)
# What this PR does / why we need it?

Add DeepSeek-V3.2-W8A8 dual-node nightly CI test and update A3 nightly
test configuration:

1. Add DeepSeek-V3.2-W8A8 dual-node test:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml
    - 2 nodes, 16 NPUs per node (32 NPUs total)
- Configuration: 2P+1D (data-parallel-size=4, tensor-parallel-size=8,
data-parallel-size-local=2)
    - Includes performance and accuracy benchmarks with GSM8K dataset
  2. Update A3 nightly workflow: .github/workflows/nightly_test_a3.yaml
- Added DeepSeek-V3.2-W8A8 dual-node test to the A3 nightly test matrix
    - Test name: multi-node-dpsk3.2-2node
3. Improve test scripts: Updated
.github/workflows/_e2e_nightly_multi_node.yaml and related scripts for
better multi-node testing support

test on A3 instances
  - Performance baseline: 1 (threshold: 0.97)
  - Accuracy baseline: 95% (threshold: 5%)
- Test dataset: GSM8K with 512 prompts for performance, gsm8k-lite for
accuracy
---------
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-07 10:02:02 +08:00