Files

wangxiyuan 3b59d0ebe9 [Doc][Feature] Add vLLM Ascend development guidelines AGETNS.md (#6797 )

### What this PR does / why we need it?
This PR adds a new document, `AGENTS.md`, which provides detailed
development guidelines for contributors to the vLLM Ascend project.
These guidelines cover code style, testing, NPU-specific considerations,
and the contribution process to ensure code quality and consistency.

### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update for developers.

### How was this patch tested?
This is a documentation change and does not require testing.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

2026-02-26 08:47:46 +08:00

11 KiB

Raw Blame History

vLLM Ascend Development Guidelines

This document provides instructions for contributors to the vLLM Ascend project. Please read and follow these guidelines to ensure code quality, maintainability, and consistency.

Setup and Environment
- Environment Variables
- Environment Variable Review Requirement
Testing
- Unit and System Tests
- Running Tests
Code Style
- Python Conventions
- Naming Conventions
NPU-Specific Considerations
- Tensor item() Operations
- Memory and Performance
Model and Plugin Architecture
Commit Messages and Pull Requests
- Commit Message Format
Review Checklist
Quick Start for Contributors
References

Setup and Environment

Environment Variables

All environment variables must be defined in vllm_ascend/envs.py using the centralized env_variables dictionary.

Requirements:

Add documentation for each environment variable in the env_variables dict comment
Specify default values and valid ranges
Indicate whether the variable is sensitive (credentials, keys)

Example:

env_variables = {
    "VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
    # ...
}

Never: Hardcode environment variable names throughout the codebase. Reference them from the central module using from vllm_ascend import envs.

Environment Variable Review Requirement

Strict Review Required: All new environment variables must undergo code review.

Reviewers must verify:

The variable name follows the VLLM_ASCEND_* naming convention
Default value is appropriate for all supported hardware
Documentation is added to the env_variables dict
The variable is used in a performance-critical path

Testing

Unit and System Tests

Requirement: All new functionality requires corresponding tests.

Unit Tests (UT): Located in tests/ut/, cover core logic, edge cases, and error conditions
System Tests (ST): Located in tests/e2e/, verify end-to-end behavior and integration points
Nightly Tests: Include benchmarks for NPU-specific code paths in tests/e2e/nightly/

Test Coverage Guidelines:

New features: Tests must cover happy path and failure modes
Bug fixes: Tests must include a regression test for the bug
Performance-critical code: Include benchmarks and performance regression tests

Running Tests



# Run specific unit test file
pytest -sv tests/ut/ops/test_prepare_finalize.py

# Run specific unit test
pytest -sv tests/ut/ops/test_prepare_finalize.py::test_prepare_inputs

# Run NPU-specific tests (requires NPU hardware)
pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency

Requirement: Run all tests locally before requesting review. Verify tests pass on NPU hardware for NPU-specific changes.

Code Style

Python Conventions

Imports: All imports at the top of the file. Valid exceptions:
- Circular imports (use inline imports)
- Lazy loading for worker/isolation processes
- Type-checking imports wrapped in if TYPE_CHECKING:
Global Variables: Avoid new global variables. Pass dependencies explicitly through function parameters.

Allowed:
- Constants named ALL_UPPER_CASE (e.g., MAX_BATCH_SIZE in envs.py)
- Immutable configuration objects
Requires Approval:
- Any new mutable global state

No Magic Numbers: Use named constants with descriptive names:

# Bad
if seq_len > 2048: ...

# Good
MAX_CONTEXT_LENGTH = 2048
if seq_len > MAX_CONTEXT_LENGTH: ...

Descriptive Naming: Use names that describe functionality, not implementation details.

# Bad
is_deepseek_v3_r1
flag1
tmp_var

# Good
supports_dynamic_temperature
uses_speculative_decoding

Naming Conventions

Classes: PascalCase (e.g., NPUModelRunner, AscendSampler, ACLGraphManager)
Functions/Methods: snake_case (e.g., forward_pass, compute_attention)
Constants: ALL_UPPER_CASE (e.g., MAX_BATCH_SIZE, VLLM_ASCEND_ENABLE_NZ)
Variables: snake_case (e.g., token_ids, sequence_lengths)

NPU-Specific Considerations

Tensor item() Operations

Warning: tensor.item() operations cause synchronization overhead on NPU when the tensor is on device.

If the tensor is a device tensor, the operator item() will trigger a synchronous data transfer from NPU to CPU, which can severely degrade performance in hot paths, cause this will make AsyncScheduler block here.

Review Requirements:

Profile performance impact before merging
Consider alternative patterns:
- Keep values on device when possible
- Batch operations to reduce sync frequency
- Use device-side operations (e.g., torch.argmax, torch.sum)
Document when item() is unavoidable (e.g., logging, conditional logic)

Example Patterns:

# Bad: In hot loop - causes sync per iteration
for tensor in tensors:
    value = tensor.item()

# Better: Batch operations - single sync
values = [t.item() for t in tensors]  # Single batch sync

# Good: Keep on device when possible
max_value = torch.max(tensor)  # No sync needed
if max_value > threshold:  # Comparison can stay on device
    ...

Memory and Performance

Additional NPU-specific best practices:

Avoid CPU-NPU memory transfers in hot paths
Prefer in-place operations where safe (e.g., x.add_(), x.mul_())
Monitor memory fragmentation, especially for long-running processes
Test with realistic workloads on actual NPU hardware (Ascend 910B/C)

Model and Plugin Architecture

vLLM Ascend Plugin Architecture

vLLM Ascend is a hardware plugin that integrates with upstream vLLM via the pluggable hardware interface. It does not add new model files directly.

Required Pattern: Model-specific functionality should be implemented via:

Patching (in vllm_ascend/patch/):
- vllm_ascend/patch/platform/ - Platform-level patches (distributed, scheduling)
- vllm_ascend/patch/worker/ - Worker-level patches (model-specific behavior)
- Example: patch_deepseek.py modifies upstream Deepseek model behavior
- Patch is not the best solution for all cases. Use it when necessary.
Inheritance:
- NPUModelRunner(GPUModelRunner) - Extend vLLM model runner with NPU-specific behavior
- AscendSampler - Extend vLLM sampler with NPU-specific operations
- Add NPU-specific components via composition (e.g., AclGraphManager)
- Custom Operators - NPU-specific custom operators (e.g., AscendRMSNorm)
External upstream contributions where appropriate

Patching Requirement

Strict Review Required: All new patches must undergo thorough architectural review.

Reviewers must verify:

The patch targets the correct upstream component
The patch is minimal and focused
Performance implications are understood
A long-term plan exists for upstream contribution

Example Patch Pattern:

# vllm_ascend/patch/worker/patch_deepseek.py
from vllm.model_executor.models.deepseek_v2 import DeepseekV2Model

def forward(self, input_ids, positions, ...):
    # NPU-specific forward implementation
    ...

DeepseekV2Model.forward = forward  # Patch upstream class

Model Runner Changes

Strict Review Required: All new behaviors added to model_runner must undergo thorough architectural review.

Reviewers must verify:

The necessity of the new behavior (why can't this be in a patch?)
Performance implications on NPU hardware
Compatibility with existing model implementations
Long-term maintainability and test coverage

NPU Model Runner Files:

vllm_ascend/worker/model_runner_v1.py - vLLM v1 model runner
vllm_ascend/worker/v2/model_runner.py - vLLM v2 model runner
vllm_ascend/_310p/model_runner_310p.py - Ascend 310P model runner

Commit Messages and Pull Requests

Commit Message Format

Follow the Conventional Commits format:

<type>: <summary>

<body - explaining what changed and why>

Valid Types: feat, fix, perf, refactor, test, docs, chore

Good Examples:

feat(npu): add flash attention support for Ascend CANN

- Implements FlashAttention-2 kernel for NPU backend
- Reduces memory usage by 30% compared to baseline

fix(model_runner): correct padding token handling

- Fixes token padding that caused incorrect attention masks
- Addresses issue #1234

perf: avoid CPU-NPU sync in attention computation

- Inline computation to avoid tensor.item() calls
- Improves throughput by 15%

Bad Examples:

fix bug
add feature
update code

Review Checklist

Before merging, verify:

Code Quality

Code follows style guidelines (naming, imports, no magic numbers)
No global state added without justification
Patching pattern used correctly (if applicable)
No direct model file additions

Testing

New tests added for new functionality (tests/ut/ or tests/e2e/)
Existing tests pass
NPU-specific tests verified on actual hardware
Performance benchmarks included where applicable

Documentation

Environment variables documented
Public APIs documented
User-facing changes reflected in docs

NPU Considerations

tensor.item() usage reviewed for performance impact
No unnecessary CPU-NPU transfers in hot paths
Memory usage verified on NPU hardware

Commit and PR

Commit messages are clear and descriptive
PR description is complete
All review comments addressed

Quick Start for Contributors

Install development dependencies: pip install -e .[dev]
Run tests: pytest tests/
Check linting: ruff check vllm_ascend/
Format code: ruff format vllm_ascend/
Make your changes following guidelines in this document
Add tests for new behavior
Run full test suite before committing
Create a PR with clear description

11 KiB Raw Blame History

vLLM Ascend Development Guidelines

Table of Contents

Setup and Environment

Environment Variables

Environment Variable Review Requirement

Testing

Unit and System Tests

Running Tests

Code Style

Python Conventions

Naming Conventions

NPU-Specific Considerations

Tensor item() Operations

Memory and Performance

Model and Plugin Architecture

vLLM Ascend Plugin Architecture

Patching Requirement

Model Runner Changes

Commit Messages and Pull Requests

Commit Message Format

Review Checklist

Code Quality

Testing

Documentation

NPU Considerations

Commit and PR

Quick Start for Contributors

References

11 KiB

Raw Blame History