Sync from v0.13

2026-01-19 10:38:50 +08:00
parent b2ef04d792
commit 5aef6c175a
3714 changed files with 854317 additions and 89342 deletions
--- a/docs/contributing/ci/failures.md
+++ b/docs/contributing/ci/failures.md
@@ -0,0 +1,118 @@
+# CI Failures
+
+What should I do when a CI job fails on my PR, but I don't think my PR caused
+the failure?
+
+- Check the dashboard of current CI test failures:  
+  👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20)
+
+- If your failure **is already listed**, it's likely unrelated to your PR.
+  Help fixing it is always welcome!
+    - Leave comments with links to additional instances of the failure.
+    - React with a 👍 to signal how many are affected.
+
+- If your failure **is not listed**, you should **file an issue**.
+
+## Filing a CI Test Failure Issue
+
+- **File a bug report:**  
+    👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml)
+
+- **Use this title format:**
+
+    ```text
+    [CI Failure]: failing-test-job - regex/matching/failing:test
+    ```
+
+- **For the environment field:**
+
+    ```text
+    Still failing on main as of commit abcdef123
+    ```
+
+- **In the description, include failing tests:**
+
+    ```text
+    FAILED failing/test.py:failing_test1 - Failure description
+    FAILED failing/test.py:failing_test2 - Failure description
+    https://github.com/orgs/vllm-project/projects/20
+    https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml
+    FAILED failing/test.py:failing_test3 - Failure description
+    ```
+
+- **Attach logs** (collapsible section example):
+    <details>
+    <summary>Logs:</summary>
+
+    ```text
+    ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data
+    --- Logging error ---  
+    Traceback (most recent call last):  
+      File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model  
+        return self.model_executor.execute_model(scheduler_output)
+    ...
+    FAILED failing/test.py:failing_test1 - Failure description
+    FAILED failing/test.py:failing_test2 - Failure description
+    FAILED failing/test.py:failing_test3 - Failure description
+    ```
+
+    </details>
+
+## Logs Wrangling
+
+Download the full log file from Buildkite locally.
+
+Strip timestamps and colorization:
+
+[.buildkite/scripts/ci-clean-log.sh](../../../.buildkite/scripts/ci-clean-log.sh)
+
+```bash
+./ci-clean-log.sh ci.log
+```
+
+Use a tool [wl-clipboard](https://github.com/bugaevc/wl-clipboard) for quick copy-pasting:
+
+```bash
+tail -525 ci_build.log | wl-copy
+```
+
+## Investigating a CI Test Failure
+
+1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main)
+2. Bisect to find the first build that shows the issue.  
+3. Add your findings to the GitHub issue.  
+4. If you find a strong candidate PR, mention it in the issue and ping contributors.
+
+## Reproducing a Failure
+
+CI test failures may be flaky. Use a bash loop to run repeatedly:
+
+[.buildkite/scripts/rerun-test.sh](../../../.buildkite/scripts/rerun-test.sh)
+
+```bash
+./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]
+```
+
+## Submitting a PR
+
+If you submit a PR to fix a CI failure:
+
+- Link the PR to the issue:
+  Add `Closes #12345` to the PR description.
+- Add the `ci-failure` label:
+  This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20).
+
+## Other Resources
+
+- 🔍 [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability)
+- 🧪 [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main)
+
+## Daily Triage
+
+Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to:
+
+- Identify recent test failures **on `main`**.
+- Exclude legitimate test failures on PRs.
+- (Optional) Ignore tests with 0% reliability.
+
+Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20).
--- a/docs/contributing/ci/nightly_builds.md
+++ b/docs/contributing/ci/nightly_builds.md
@@ -0,0 +1,160 @@
+# Nightly Builds of vLLM Wheels
+
+vLLM maintains a per-commit wheel repository (commonly referred to as "nightly") at `https://wheels.vllm.ai` that provides pre-built wheels for every commit on the `main` branch since `v0.5.3`. This document explains how the nightly wheel index mechanism works.
+
+## Build and Upload Process on CI
+
+### Wheel Building
+
+Wheels are built in the `Release` pipeline (`.buildkite/release-pipeline.yaml`) after a PR is merged into the main branch, with multiple variants:
+
+- **Backend variants**: `cpu` and `cuXXX` (e.g., `cu129`, `cu130`).
+- **Architecture variants**: `x86_64` and `aarch64`.
+
+Each build step:
+
+1. Builds the wheel in a Docker container.
+2. Renames the wheel filename to use the correct manylinux tag (currently `manylinux_2_31`) for PEP 600 compliance.
+3. Uploads the wheel to S3 bucket `vllm-wheels` under `/{commit_hash}/`.
+
+### Index Generation
+
+After uploading each wheel, the `.buildkite/scripts/upload-wheels.sh` script:
+
+1. **Lists all existing wheels** in the commit directory from S3
+2. **Generates indices** using `.buildkite/scripts/generate-nightly-index.py`:
+    - Parses wheel filenames to extract metadata (version, variant, platform tags).
+    - Creates HTML index files (`index.html`) for PyPI compatibility.
+    - Generates machine-readable `metadata.json` files.
+3. **Uploads indices** to multiple locations (overriding existing ones):
+    - `/{commit_hash}/` - Always uploaded for commit-specific access.
+    - `/nightly/` - Only for commits on `main` branch (not PRs).
+    - `/{version}/` - Only for release wheels (no `dev` in its version).
+
+!!! tip "Handling Concurrent Builds"
+    The index generation script can handle multiple variants being built concurrently by always listing all wheels in the commit directory before generating indices, avoiding race conditions.
+
+## Directory Structure
+
+The S3 bucket structure follows this pattern:
+
+```text
+s3://vllm-wheels/
+├── {commit_hash}/              # Commit-specific wheels and indices
+│   ├── vllm-*.whl              # All wheel files
+│   ├── index.html              # Project list (default variant)
+│   ├── vllm/
+│   │   ├── index.html          # Package index (default variant)
+│   │   └── metadata.json       # Metadata (default variant)
+│   ├── cu129/                  # Variant subdirectory
+│   │   ├── index.html          # Project list (cu129 variant)
+│   │   └── vllm/
+│   │       ├── index.html      # Package index (cu129 variant)
+│   │       └── metadata.json   # Metadata (cu129 variant)
+│   ├── cu130/                  # Variant subdirectory
+│   ├── cpu/                    # Variant subdirectory
+│   └── .../                    # More variant subdirectories
+├── nightly/                    # Latest main branch wheels (mirror of latest commit)
+└── {version}/                  # Release version indices (e.g., 0.11.2)
+```
+
+All built wheels are stored in `/{commit_hash}/`, while different indices are generated and reference them.
+This avoids duplication of wheel files.
+
+For example, you can specify the following URLs to use different indices:
+
+- `https://wheels.vllm.ai/nightly/cu130` for the latest main branch wheels built with CUDA 13.0.
+- `https://wheels.vllm.ai/{commit_hash}` for wheels built at a specific commit (default variant).
+- `https://wheels.vllm.ai/0.12.0/cpu` for 0.12.0 release wheels built for CPU variant.
+
+Please note that not all variants are present on every commit. The available variants are subject to change over time, e.g., changing cu130 to cu131.
+
+### Variant Organization
+
+Indices are organized by variant:
+
+- **Default variant**: Wheels without variant suffix (i.e., built with the current `VLLM_MAIN_CUDA_VERSION`) are placed in the root.
+- **Variant subdirectories**: Wheels with variant suffixes (e.g., `+cu130`, `.cpu`) are organized in subdirectories.
+- **Alias to default**: The default variant can have an alias (e.g., `cu129` for now) for consistency and convenience.
+
+The variant is extracted from the wheel filename (as described in the [file name convention](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-name-convention)):
+
+- The variant is encoded in the local version identifier (e.g. `+cu129` or `dev<N>+g<hash>.cu130`).
+- Examples:
+    - `vllm-0.11.2.dev278+gdbc3d9991-cp38-abi3-manylinux1_x86_64.whl` → default variant
+    - `vllm-0.10.2rc2+cu129-cp38-abi3-manylinux2014_aarch64.whl` → `cu129` variant
+    - `vllm-0.11.1rc8.dev14+gaa384b3c0.cu130-cp38-abi3-manylinux1_x86_64.whl` → `cu130` variant
+
+## Index Generation Details
+
+The `generate-nightly-index.py` script performs the following:
+
+1. **Parses wheel filenames** using regex to extract:
+    - Package name
+    - Version (with variant extracted)
+    - Python tag, ABI tag, platform tag
+    - Build tag (if present)
+2. **Groups wheels by variant**, then by package name:
+    - Currently only `vllm` is built, but the structure supports multiple packages in the future.
+3. **Generates HTML indices** (compliant with the [Simple repository API](https://packaging.python.org/en/latest/specifications/simple-repository-api/#simple-repository-api)):
+    - Top-level `index.html`: Lists all packages and variant subdirectories
+    - Package-level `index.html`: Lists all wheel files for that package
+    - Uses relative paths to wheel files for portability
+4. **Generates metadata.json**:
+    - Machine-readable JSON containing all wheel metadata
+    - Includes `path` field with URL-encoded relative path to wheel file
+    - Used by `setup.py` to locate compatible pre-compiled wheels during Python-only builds
+
+### Special Handling for AWS Services
+
+The wheels and indices are directly stored on AWS S3, and we use AWS CloudFront as a CDN in front of the S3 bucket.
+
+Since S3 does not provide proper directory listing, to support PyPI-compatible simple repository API behavior, we deploy a CloudFront Function that:
+
+- redirects any URL that does not end with `/` and does not look like a file (i.e., does not contain a dot `.` in the last path segment) to the same URL with a trailing `/`
+- appends `/index.html` to any URL that ends with `/`
+
+For example, the following requests would be handled as:
+
+- `/nightly` -> `/nightly/index.html`
+- `/nightly/cu130/` -> `/nightly/cu130/index.html`
+- `/nightly/index.html` or `/nightly/vllm.whl` -> unchanged
+
+!!! note "AWS S3 Filename Escaping"
+
+    S3 will automatically escape filenames upon upload according to its [naming rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html). The direct impact on vllm is that `+` in filenames will be converted to `%2B`. We take special care in the index generation script to escape filenames properly when generating the HTML indices and JSON metadata, to ensure the URLs are correct and can be directly used.
+
+## Usage of precompiled wheels in `setup.py` {#precompiled-wheels-usage}
+
+When installing vLLM with `VLLM_USE_PRECOMPILED=1`, the `setup.py` script:
+
+1. **Determines wheel location** via `precompiled_wheel_utils.determine_wheel_url()`:
+    - Env var `VLLM_PRECOMPILED_WHEEL_LOCATION` (user-specified URL/path) always takes precedence and skips all other steps.
+    - Determines the variant from `VLLM_MAIN_CUDA_VERSION` (can be overridden with env var `VLLM_PRECOMPILED_WHEEL_VARIANT`); the default variant will also be tried as a fallback.
+    - Determines the _base commit_ (explained later) of this branch (can be overridden with env var `VLLM_PRECOMPILED_WHEEL_COMMIT`).
+2. **Fetches metadata** from `https://wheels.vllm.ai/{commit}/vllm/metadata.json` (for the default variant) or `https://wheels.vllm.ai/{commit}/{variant}/vllm/metadata.json` (for a specific variant).
+3. **Selects compatible wheel** based on:
+    - Package name (`vllm`)
+    - Platform tag (architecture match)
+4. **Downloads and extracts** precompiled binaries from the wheel:
+    - C++ extension modules (`.so` files)
+    - Flash Attention Python modules
+    - Triton kernel Python files
+5. **Patches package_data** to include extracted files in the installation
+
+!!! note "What is the base commit?"
+
+    The base commit is determined by finding the merge-base
+    between the current branch and upstream `main`, ensuring
+    compatibility between source code and precompiled binaries.
+
+_Note: it's users' responsibility to ensure there is no native code (e.g., C++ or CUDA) changes before using precompiled wheels._
+
+## Implementation Files
+
+Key files involved in the nightly wheel mechanism:
+
+- **`.buildkite/release-pipeline.yaml`**: CI pipeline that builds wheels
+- **`.buildkite/scripts/upload-wheels.sh`**: Script that uploads wheels and generates indices
+- **`.buildkite/scripts/generate-nightly-index.py`**: Python script that generates PyPI-compatible indices
+- **`setup.py`**: Contains `precompiled_wheel_utils` class for fetching and using precompiled wheels
--- a/docs/contributing/ci/update_pytorch_version.md
+++ b/docs/contributing/ci/update_pytorch_version.md
@@ -0,0 +1,109 @@
+# Update PyTorch version on vLLM OSS CI/CD
+
+vLLM's current policy is to always use the latest PyTorch stable
+release in CI/CD. It is standard practice to submit a PR to update the
+PyTorch version as early as possible when a new [PyTorch stable
+release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
+This process is non-trivial due to the gap between PyTorch
+releases. Using <https://github.com/vllm-project/vllm/pull/16859> as an example, this document outlines common steps to achieve this
+update along with a list of potential issues and how to address them.
+
+## Test PyTorch release candidates (RCs)
+
+Updating PyTorch in vLLM after the official release is not
+ideal because any issues discovered at that point can only be resolved
+by waiting for the next release or by implementing hacky workarounds in vLLM.
+The better solution is to test vLLM with PyTorch release candidates (RC) to ensure
+compatibility before each release.
+
+PyTorch release candidates can be downloaded from [PyTorch test index](https://download.pytorch.org/whl/test).
+For example, `torch2.7.0+cu12.8` RC can be installed using the following command:
+
+```bash
+uv pip install torch torchvision torchaudio \
+    --index-url https://download.pytorch.org/whl/test/cu128
+```
+
+When the final RC is ready for testing, it will be announced to the community
+on the [PyTorch dev-discuss forum](https://dev-discuss.pytorch.org/c/release-announcements).
+After this announcement, we can begin testing vLLM integration by drafting a pull request
+following this 3-step process:
+
+1. Update [requirements files](https://github.com/vllm-project/vllm/tree/main/requirements)
+to point to the new releases for `torch`, `torchvision`, and `torchaudio`.
+
+2. Use the following option to get the final release candidates' wheels. Some common platforms are `cpu`, `cu128`, and `rocm6.2.4`.
+
+    ```bash
+    --extra-index-url https://download.pytorch.org/whl/test/<PLATFORM>
+    ```
+
+3. Since vLLM uses `uv`, ensure the following index strategy is applied:
+
+    - Via environment variable:
+
+    ```bash
+    export UV_INDEX_STRATEGY=unsafe-best-match
+    ```
+
+    - Or via CLI flag:
+
+    ```bash
+    --index-strategy unsafe-best-match
+    ```
+
+If failures are found in the pull request, raise them as issues on vLLM and
+cc the PyTorch release team to initiate discussion on how to address them.
+
+## Update CUDA version
+
+The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example, torch `2.7.1+cu126`) is uploaded to PyPI. However, vLLM may require a different CUDA version,
+such as 12.8 for Blackwell support.
+This complicates the process as we cannot use the out-of-the-box
+`pip install torch torchvision torchaudio` command. The solution is to use
+`--extra-index-url` in vLLM's Dockerfiles.
+
+- Important indexes at the moment include:
+
+| Platform | `--extra-index-url` |
+|----------|-----------------|
+| CUDA 12.8| [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128)|
+| CPU      | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu)|
+| ROCm 6.2 | [https://download.pytorch.org/whl/rocm6.2.4](https://download.pytorch.org/whl/rocm6.2.4) |
+| ROCm 6.3 | [https://download.pytorch.org/whl/rocm6.3](https://download.pytorch.org/whl/rocm6.3) |
+| XPU      | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) |
+
+- Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI.
+    - `.buildkite/release-pipeline.yaml`
+    - `.buildkite/scripts/upload-wheels.sh`
+
+## Address long vLLM build time
+
+When building vLLM with a new PyTorch/CUDA version, no cache will exist
+in the vLLM sccache S3 bucket, causing the build job on CI to potentially take more than 5 hours
+and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mode,
+it doesn't populate the cache, so re-running it to warm up the cache
+is ineffective.
+
+While ongoing efforts like <https://github.com/vllm-project/vllm/issues/17419>
+address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
+to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/long_build`)
+when manually triggering a build on Buildkite. This branch accomplishes two things:
+
+1. Increase the timeout limit to 10 hours so that the build doesn't time out.
+2. Allow the compiled artifacts to be written to the vLLM sccache S3 bucket
+to warm it up so that future builds are faster.
+
+<p align="center" width="100%">
+    <img width="60%" alt="Buildkite new build popup" src="https://github.com/user-attachments/assets/a8ff0fcd-76e0-4e91-b72f-014e3fdb6b94">
+</p>
+
+## Update all the different vLLM platforms
+
+Rather than attempting to update all vLLM platforms in a single pull request, it's more manageable
+to handle some platforms separately. The separation of requirements and Dockerfiles
+for different platforms in vLLM CI/CD allows us to selectively choose
+which platforms to update. For instance, updating XPU requires the corresponding
+release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
+While <https://github.com/vllm-project/vllm/pull/16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
+<https://github.com/vllm-project/vllm/pull/17444> completed the update for XPU.