Cherry-pick https://github.com/vllm-project/vllm-ascend/pull/8683
### What this PR does / why we need it?
This PR relaxes the TTFT threshold from `0.4` to `0.5` to improve
robustness under Data Parallel (DP) load imbalance.
#### Background
The current assertion enforces: prefix75 < prefix0 * 0.4
#### ❌ Nightly Failure Cases (Observed)
| prefix0 | threshold (0.4x) | prefix75 | delta |
|--------|------------------|----------|--------|
| 4696.24 | 1878.50 | 1883.99 | +5.49 |
| 4696.20 | 1878.48 | 1896.01 | +17.53 |
| 4636.73 | 1854.69 | 1902.48 | +47.79 |
| 4655.17 | 1862.07 | 1913.54 | +51.47 |
| 4685.35 | 1874.14 | 1919.36 | +45.22 |
| 4660.33 | 1864.13 | 1915.41 | +51.28 |
| 4648.30 | 1859.32 | 1950.50 | +91.18 |
| 4655.30 | 1862.12 | 1962.32 | +100.20 |
---
#### ✅ Nightly Passing Cases (Observed)
| prefix0 | threshold (0.4x) | prefix75 | margin |
|--------|------------------|----------|---------|
| 4685.64 | 1874.26 | 1864.46 | -9.80 |
| 5520.28 | 2208.11 | 1928.97 | -279.14 |
| 4639.23 | 1855.69 | 1846.86 | -8.83 |
| 4651.64 | 1860.66 | 1854.30 | -6.36 |
| 4640.39 | 1856.15 | 1840.32 | -15.83 |
| 4677.20 | 1870.88 | 1848.35 | -22.53 |
---
#### Key Observations
- Failures exceed the threshold by only **~5 ms to ~100 ms (~0.3%–5%)**
- Passing cases often have **very tight margins (~5–10 ms)**
- There is clear **overlap between pass and fail boundaries**
- Many failures are **borderline violations**, not real regressions
---
#### Root Cause
The instability is caused by **Data Parallel (DP) load imbalance**,
which introduces systematic variance:
- Uneven request distribution across workers
- Queueing delays
- Increased TTFT variance (especially for `prefix75`)
---
#### Conclusion
- The current threshold (`0.4x`) is **too strict**
- Observed natural fluctuation:
- Absolute: up to ~100 ms
- Relative: up to ~5% over threshold
- Pass/fail boundary is currently **too sensitive to runtime jitter**
---
#### Change
We relax the threshold: **0.4 → 0.5**
This adjustment:
- Accounts for expected runtime variance
- Reduces false negatives
- Maintains a meaningful performance constraint
Even with `0.5`, the requirement remains strict (`prefix75 < 50% of
prefix0`) and does not mask real regressions.
---
### Does this PR introduce _any_ user-facing change?
No.
This change only affects internal test assertions and does not impact
user-facing behavior or model performance.
---
### How was this patch tested?
- Verified against existing TTFT test cases:
- Previously failing cases (due to small variance) now pass
- No regressions observed in other scenarios
- Confirmed that failures were due to DP load imbalance rather than
actual performance degradation
- Ensured the updated threshold still enforces a meaningful constraint
on TTFT
Signed-off-by: underfituu <hzhucong@163.com>