diff --git a/docs/source/_templates/Model-Deployment-Tutorial-Template.md b/docs/source/_templates/Model-Deployment-Tutorial-Template.md new file mode 100644 index 00000000..bf6d1b36 --- /dev/null +++ b/docs/source/_templates/Model-Deployment-Tutorial-Template.md @@ -0,0 +1,233 @@ +# Deployment Tutorial Template Based on the XXX Model + +This template is based on deployment tutorials for models such as DeepSeek-V3.2 and Qwen-VL-Dense, and is intended to serve as a reference for technical documentation writing. Users can systematically construct relevant technical documentation by following the guidelines provided in this template. + +## 1 Introduction + +**Content Writing Requirements:** + +- Provide a one-sentence description of the model's basic architecture, core features, and primary application scenarios. +- Provide a one-sentence description of the document's purpose and the objectives to be achieved. +- Specify the version of vLLM-Ascend used in the document and the version support status of the model. + +**Example 1: Model Introduction** + +DeepSeek-V3.2 is a sparse attention model. Its core architecture is similar to that of DeepSeek-V3.1, but it employs a sparse attention mechanism, aiming to explore and validate optimization solutions for training and inference efficiency in long-context scenarios. + +**Example 2: Document Purpose** + +This document will demonstrate the primary validation steps for the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, as well as accuracy and performance evaluation. + +**Example 3: Version Information** + +This document is validated and written based on **vLLM-Ascend v0.13.0**. The current model (XXX) is fully supported in this version, and all **v0.13.0 and later versions** can run stably. To use the latest features (e.g., PD separation, MTP), it is recommended to use v0.13.0 or a later version. + +## 2 Feature Matrix + +This section introduces the features supported by the model, including supported hardware, quantization methods, data parallelism, long-sequence features, etc. + +**Content Writing Requirements:** + +- Present the support status of models and features in a table format. +- Alternatively, provide references with hyperlinks. + +**Example 1: Feature Support List** + +| Model Name | Support Status | Remarks | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Caching | LoRA | Speculative Decoding | Asynchronous Scheduling | Tensor Parallelism | Pipeline Parallelism | Expert Parallelism | Data Parallelism | Prefill-Decode Separation | Segmented ACL Graph Execution | Full ACL Graph Execution | Max Model Length | MLP Weight Prefetch | Documentation | +| ------ | ---------- | ------ | ------ | ---------- | ------ | ------------ | -------------- | ------ | ---------- | ---------- | ---------- | ------------ | ---------- | ---------- | ------------------- | ----------- | ----------- | ------------- | ------------- | ---------- | +| DeepSeek V3/3.1 | ✅ | | ✅ | Atlas 800I A2:
Minimum card requirement: xx | ✅ | ✅ | ✅ | | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 240k | | [DeepSeek-V3.1](../../tutorials/models/DeepSeek-V3.1.md) | +| DeepSeek V3.2 | ✅ | | ✅ | Atlas 800I A2:
Minimum card requirement: xx | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ | [DeepSeek-V3.2](../../tutorials/models/DeepSeek-V3.2.md) | +| DeepSeek R1 | ✅ | | ✅ | Atlas 800I A2:
Minimum card requirement: xx | ✅ | ✅ | ✅ | | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 128k | | [DeepSeek R1](../../tutorials/models/DeepSeek-R1.md) | +| Qwen3 | ✅ | | ✅ | Atlas 800I A2:
Minimum card requirement: xx | ✅ | ✅ | ✅ | | | ✅ | ✅ | | | ✅ | | ✅ | ✅ | 128k | ✅ | [Qwen3](../../tutorials/models/Qwen3-Dense.md) | + +**Note**: This is a simplified example. Please refer to the complete feature matrix for the full table. + +**Example 2: Reference Citation** + +Please refer to the [Supported Features List](../user_guide/support_matrix/supported_models.md) for the model support matrix. + +Please refer to the [Feature Guide](../user_guide/feature_guide/index.md) for feature configuration information. + +## 3 Environment Preparation + +### 3.1 Model Weight + +**Content Writing Requirements:** Describe the hardware resources, software environment, and model files required for deployment. + +**Example:** + +| Model Version | Hardware Requirements | Download Link | +| ---------- | ---------- | ---------- | +| DeepSeek-V3.2-Exp (BF16) | 2×Atlas 800 A3 (64G×16)
4×Atlas 800 A2 (64G×8) | [Model Weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16) | +| DeepSeek-V3.2-Exp-w8a8 (Quantized) | 1×Atlas 800 A3 (64G×16)
2×Atlas 800 A2 (64G×8) | [Model Weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8) | +| DeepSeek-V3.2-w8a8 (Quantized) | 1×Atlas 800 A3 (64G×16)
2×Atlas 800 A2 (64G×8) | [Model Weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/) | + +### 3.2 Verify Multi-node Communication (Optional) + +**Example:** + +If multi-node deployment is required, please follow the [Verify Multi-node Communication Environment](../installation.md#verify-multi-node-communication) guide for communication verification. + +## 4 Installation + +**Content Writing Requirements:** + +- Provide specific steps and startup commands, covering both single-node and multi-node configurations. +- Provide explanations for parameters, including meaning, value range, and units. +- Specify the basic environment variables and communication environment variables that need to be enabled, with explanations including meaning, value range, and units. + +### 4.1 Docker Image Installation + +**Example:** Omitted + +### 4.2 Source Code Installation + +**Example:** Omitted + +## 5 Online Service Deployment + +### 5.1 Single-Node Online Deployment + +**Content Writing Requirements:** + +- Describe the architectural characteristics and applicable scenarios of single-node deployment. +- Provide startup command templates and key parameter descriptions. +- Provide service verification methods. + +**Example:** + +Single-node deployment completes both Prefill and Decode within the same node, suitable for XXX scenarios. + +Startup Command: + +```bash +# Omitted +``` + +Service Verification: + +```bash +# Omitted +``` + +### 5.2 Multi-Node PD Separation Deployment + +**Content Writing Requirements:** + +- Describe the principles of PD separation architecture and applicable scenarios. +- List prerequisites (network, storage, permissions). +- Provide script frameworks and key configuration item descriptions. +- Specify node role division and startup procedures. +- Indicate performance metrics. + +**Example:** Omitted + +### 5.3 Special Deployment Modes (Optional) + +**Content Writing Requirements:** + +- If the model features non‑standard deployment modes (e.g., offline batch processing for embedding models, low‑latency online serving for reranker models), the corresponding deployment solutions must be explicitly documented. +- Section 5 "Online Service Deployment" provides examples for single‑node online service deployment and multi‑node PD‑separated deployment, which can be referenced and extended. + +## 6 Functional Verification + +**Content Writing Requirements:** Guide users on how to test the basic functionality of the model through simple interface calls after the service is started. + +**Example:** + +After the service is started, the model can be invoked by sending a prompt: + +```shell +curl http://:/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek_v3.2", + "prompt": "The future of AI is", + "max_tokens": 50, + "temperature": 0 + }' +``` + +## 7 Accuracy Evaluation + +**Content Writing Requirements:** Introduce standardized methods and tools for evaluating model output quality (accuracy). Two accuracy evaluation methods are provided below as examples; alternatively, provide direct links to existing documentation. + +### Using AISBench + +For details, please refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md). + +### Using Language Model Evaluation Harness + +Using the `gsm8k` dataset as an example test dataset, run the accuracy evaluation for `DeepSeek-V3.2-W8A8` in online mode. + +1. For `lm_eval` installation, please refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md). +2. Run `lm_eval` to execute the accuracy evaluation. + +```shell +lm_eval \ + --model local-completions \ + --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \ + --tasks gsm8k \ + --output_path ./ +``` + +## 8 Performance + +Omitted. Requirements are the same as for Accuracy Evaluation. + +## 9 Best Practices + +**Content Writing Requirements:** + +Provide recommended configurations for three scenarios (long sequence, low latency, high throughput) for each model that can achieve optimal performance, but do not provide specific performance data. + +## 10 Performance Tuning (Optional) + +**Content Writing Requirements:** + +- Summarize key optimization techniques and parameter tuning experiences for the model to help users achieve optimal performance in specific scenarios. Include optimization technique descriptions, enablement methods, parameter tuning recommendations, and typical configuration examples. +- Hyperlinks to the features guide may be used to allow users to view detailed descriptions of specific features. + +### 10.1 Key Optimization Points + +In this section, we will introduce the key optimization points that can significantly improve the performance of the XX model. These techniques aim to improve throughput and efficiency in various scenarios. + +#### 10.1.1 Basic Optimizations + +**Example:** + +The following optimizations are enabled by default and require no additional configuration: + +| Optimization Technique | Technical Principle | Performance Benefit | +| --------- | --------- | --------- | +| Rope Optimization | The cos_sin_cache and indexing operations of positional encoding are executed only in the first layer, and subsequent layers reuse them directly | Reduces redundant computation during the decoding phase, accelerating inference | +| AddRMSNormQuant Fusion | Merges address-wise multi-scale normalization and quantization operations into a single operator | Optimizes memory access patterns, improving computational efficiency | +| Zero-like Elimination | Removes unnecessary zero-tensor operations in Attention forward pass | Reduces memory footprint, improves matrix operation efficiency | +| FullGraph Optimization | Captures and replays the entire decoding graph at once using `compilation_config={"cudagraph_mode":"FULL_DECODE_ONLY"}` | Significantly reduces scheduling latency, stabilizes multi-device performance | + +#### 10.1.2 Advanced Optimizations (Require Explicit Enablement) + +**Example:** + +| Optimization Technique | Technical Principle | Enablement Method | Applicable Scenarios | Precautions | +| --------- | --------- | --------- | --------- | --------- | +| FlashComm_v1 | Decomposes traditional Allreduce into Reduce-Scatter and All-Gather, reducing RMSNorm computation dimensions | `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` | High-concurrency, Tensor Parallelism (TP) scenarios | Threshold protection: Only takes effect when the actual number of tokens exceeds the threshold to avoid performance degradation in low-concurrency scenarios | +| Matmul-ReduceScatter Fusion | Fuses matrix multiplication and Reduce-Scatter operations to achieve pipelined parallel processing | Automatically enabled after enabling FlashComm_v1 | Large-scale distributed environments | Same as FlashComm_v1, has threshold protection | +| Weight Prefetch | Utilizes vector computation time to prefetch MLP weights into L2 cache in advance | `export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1` | MLP-intensive scenarios (Dense models) | Requires coordination with prefetch buffer size adjustment | +| Asynchronous Scheduling | Non-blocking task scheduling to improve concurrent processing capability | `--async-scheduling` | Large-scale models, high-concurrency scenarios | Should be used in coordination with FullGraph optimization | + +### 10.2 Optimization Highlights + +**Content Writing Requirements:** + +Summarize the most noteworthy optimization points during the actual tuning process, distill core experiences, and provide readers with tuning ideas for getting started quickly. + +**Example:** + +During the actual tuning process, the following points are most critical for performance improvement: The prefetch buffer size needs to be determined through empirical measurement to find the optimal overlap between computation and prefetching; the setting of `max-num-batched-tokens` needs to balance throughput and video memory to avoid excessive chunking or OOM risk; `cudagraph_capture_sizes` must be manually specified and cover the target concurrency; when FlashComm_v1 is enabled, it is also necessary to ensure that the values are multiples of TP; `pa_shape_list` is a temporary tuning parameter that only takes effect for specific batch sizes, requiring attention to version evolution for timely adjustments. The coordinated configuration of the above parameters and environment variables is key to achieving extreme performance. + +## 11 FAQ + +**Content Writing Requirements:** + + Provide solutions to common problems, including but not limited to problem phenomenon description, cause analysis, and solution measures. diff --git a/docs/source/community/contributors.md b/docs/source/community/contributors.md index fdafc516..f10ccb29 100644 --- a/docs/source/community/contributors.md +++ b/docs/source/community/contributors.md @@ -2,7 +2,7 @@ ## Committers -| Name | Github ID | Date | +| Name | GitHub ID | Date | |:-----------:|:-----:|:-----:| | Xiyuan Wang | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 | | Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 | diff --git a/docs/source/community/versioning_policy.md b/docs/source/community/versioning_policy.md index 6e92a837..4eb787fc 100644 --- a/docs/source/community/versioning_policy.md +++ b/docs/source/community/versioning_policy.md @@ -171,7 +171,7 @@ Notes: ## Software dependency management -- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu) +- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPI](https://pypi.org/project/torch-npu) every 3 months, a development version (aka the POC version) every month, and a nightly version every day. - The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in + The PyPI stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version or branch. diff --git a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md index 00231a5f..c9be6fb3 100644 --- a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md +++ b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md @@ -10,13 +10,13 @@ Hence, KV Cache Pool is proposed to utilize various types of storage including H vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines. -While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware. +While one can utilize MooncakeStore in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports MooncakeStore and can utilize the data transfer strategy that best fits Huawei NPU hardware. -Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section). +Hence, we propose to integrate MooncakeStore with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section). ## Usage -vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector. +vLLM Ascend currently supports MooncakeStore for KV Cache Pool. To enable MooncakeStore, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector. For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html). @@ -51,7 +51,7 @@ Currently, we only perform put and get operations of KV Pool for **Prefill Nodes The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly. -To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order. +To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order. For details, please also refer to the Mooncake Connector Store Deployment Guide. @@ -86,6 +86,6 @@ The KV Connector methods that need to be implemented can be categorized into sch ## Limitations -1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool. +1. Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the storage for KV Cache pool. 2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. We will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that). diff --git a/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md b/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md index 4a51389c..83aac423 100644 --- a/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md +++ b/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md @@ -35,7 +35,7 @@ The workflow of obtaining inputs: At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model. -**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`. +**Note**: The `Token IDs` are the inputs of a model, so we also call them `Input IDs`. ### 2. Build inputs attention metadata diff --git a/docs/source/developer_guide/Design_Documents/patch.md b/docs/source/developer_guide/Design_Documents/patch.md index d7934a83..76c67409 100644 --- a/docs/source/developer_guide/Design_Documents/patch.md +++ b/docs/source/developer_guide/Design_Documents/patch.md @@ -60,7 +60,7 @@ Before writing a patch, following the principle above, we should patch the least # 1. `` # Why: # - # How: + # How: # # Related PR (if no, explain why): # diff --git a/docs/source/developer_guide/Design_Documents/quantization.md b/docs/source/developer_guide/Design_Documents/quantization.md index 56d2e02d..6334dd27 100644 --- a/docs/source/developer_guide/Design_Documents/quantization.md +++ b/docs/source/developer_guide/Design_Documents/quantization.md @@ -54,7 +54,7 @@ Based on the above content, we present a brief description of the adaptation pro - **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class. ```python -from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme +from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme, AscendMoEScheme @register_scheme("W4A8_DYNAMIC", "linear") class AscendW4A8DynamicLinearMethod(AscendLinearScheme): diff --git a/docs/source/developer_guide/contribution/multi_node_test.md b/docs/source/developer_guide/contribution/multi_node_test.md index 9f807f5d..db2f81e2 100644 --- a/docs/source/developer_guide/contribution/multi_node_test.md +++ b/docs/source/developer_guide/contribution/multi_node_test.md @@ -55,14 +55,14 @@ From the workflow perspective, we can see how the final test script is executed, deployment: - envs: - # fill with envs like: : + # fill with envs like: : server_cmd: > - vllm serve ... + vllm serve ... - envs: - # fill with envs like: : + # fill with envs like: : server_cmd: > - vllm serve ... + vllm serve ... benchmarks: perf: # fill with performance test kwargs @@ -76,34 +76,34 @@ Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml] ```yaml multi-node-tests: - name: multi-node - if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') - strategy: + name: multi-node + if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') + strategy: fail-fast: false max-parallel: 1 matrix: - test_config: + test_config: - name: multi-node-deepseek-pd - config_file_path: DeepSeek-V3.yaml - size: 2 + config_file_path: DeepSeek-V3.yaml + size: 2 - name: multi-node-qwen3-dp - config_file_path: Qwen3-235B-A22B.yaml - size: 2 + config_file_path: Qwen3-235B-A22B.yaml + size: 2 - name: multi-node-qwenw8a8-2node - config_file_path: Qwen3-235B-W8A8.yaml - size: 2 + config_file_path: Qwen3-235B-W8A8.yaml + size: 2 - name: multi-node-qwenw8a8-2node-eplb - config_file_path: Qwen3-235B-W8A8-EPLB.yaml - size: 2 - uses: ./.github/workflows/_e2e_nightly_multi_node.yaml - with: + config_file_path: Qwen3-235B-W8A8-EPLB.yaml + size: 2 + uses: ./.github/workflows/_e2e_nightly_multi_node.yaml + with: soc_version: a3 runner: linux-aarch64-a3-0 image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3' replicas: 1 size: ${{ matrix.test_config.size }} config_file_path: ${{ matrix.test_config.config_file_path }} - secrets: + secrets: KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }} ``` @@ -179,14 +179,14 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/ name: dshm volumes: - name: dshm - emptyDir: + emptyDir: medium: Memory sizeLimit: 15Gi - name: shared-volume - persistentVolumeClaim: + persistentVolumeClaim: claimName: nv-action-vllm-benchmarks-v2 - name: driver-tools - hostPath: + hostPath: path: /usr/local/Ascend/driver/tools workerTemplate: spec: @@ -224,14 +224,14 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/ name: dshm volumes: - name: dshm - emptyDir: + emptyDir: medium: Memory sizeLimit: 15Gi - name: shared-volume - persistentVolumeClaim: + persistentVolumeClaim: claimName: nv-action-vllm-benchmarks-v2 - name: driver-tools - hostPath: + hostPath: path: /usr/local/Ascend/driver/tools --- apiVersion: v1 diff --git a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md index 392e8ade..b43f0629 100644 --- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md +++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md @@ -332,7 +332,6 @@ An L0 `dump.json` contains forward I/O for modules together with parameters. Usi "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt" } } - }, } } } @@ -389,7 +388,6 @@ An L1 `dump.json` records forward I/O for APIs. Using PyTorch's `relu` function "data_name": "Functional.relu.0.forward.output.0.pt" } ] - }, } } } diff --git a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md index 1d9394eb..6aad04db 100644 --- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md +++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md @@ -111,7 +111,7 @@ sudo apt update sudo apt install libjemalloc2 # Configure jemalloc -export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD +export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD ``` #### 2.2. Tcmalloc diff --git a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md index 866c3794..373dc2ec 100644 --- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md +++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md @@ -97,7 +97,8 @@ For local `dataset-path`, please set `hf-name` to its Hugging Face ID like First start serving your model: ```bash -VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B +export VLLM_USE_MODELSCOPE=True +vllm serve Qwen/Qwen3-8B ``` Then run the benchmarking script: @@ -158,7 +159,7 @@ vllm bench throughput \ If successful, you will see the following output ```shell -Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t +Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 toks/s Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s Total num prompt tokens: 1280 Total num output tokens: 1280 diff --git a/docs/source/faqs.md b/docs/source/faqs.md index f9ad9652..79ba3ad3 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -259,7 +259,7 @@ The performance of `torch_npu.npu_fused_infer_attention_score` in small batch sc ```bash bash tools/install_flash_infer_attention_score_ops_a2.sh -## change to run the following instruction if you're using A3 machine +# change to run the following instruction if you're using A3 machine # bash tools/install_flash_infer_attention_score_ops_a3.sh ``` diff --git a/docs/source/installation.md b/docs/source/installation.md index 20a1dbe1..9502d839 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -128,7 +128,7 @@ sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq # Or using yum # yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq -# Config pip mirror +# Config pip mirror,only versions 0.11.0 and earlier are supported, if using a version later than 0.11.0, do not execute this command pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple ``` diff --git a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po index 7c17b50f..d8a8ed8f 100644 --- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po +++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po @@ -73,4 +73,4 @@ msgstr "目前,vllm-ascend 中结构化输出功能的使用方式与 vllm 中 msgid "" "Find more examples and explanations about these usages in [vLLM official " "document](https://docs.vllm.ai/en/stable/features/structured_outputs/)." -msgstr "更多关于这些用法的示例和解释,请参阅 [vLLM 官方文档](https://docs.vllm.ai/en/stable/features/structured_outputs/)。" \ No newline at end of file +msgstr "更多关于这些用法的示例和解释,请参阅 [vLLM 官方文档](https://docs.vllm.ai/en/stable/features/structured_outputs/)。" diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md index e03bc35b..fa15b399 100644 --- a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md +++ b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md @@ -327,8 +327,6 @@ The parameters are explained as follows: ## Accuracy Evaluation -Here are two accuracy evaluation methods. - ### Using AISBench 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md index 9e54e04b..05a66a8e 100644 --- a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md @@ -135,8 +135,6 @@ The parameters are explained as follows: ## Accuracy Evaluation -Here are two accuracy evaluation methods. - ### Using AISBench 1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details. diff --git a/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md b/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md index 89eb5b43..6fa5d359 100644 --- a/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md +++ b/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md @@ -240,12 +240,12 @@ If you occasionally see `zmq.error.ZMQError: Address already in use` during star ### launch_online_dp.py Use `launch_online_dp.py` to launch external dp vllm servers. -[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py) +[launch_online_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py) ### run_dp_template.sh Modify `run_dp_template.sh` on each node. -[run\_dp\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh) +[run_dp_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh) #### Layerwise diff --git a/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md b/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md index 888a603e..e96bf733 100644 --- a/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md +++ b/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md @@ -1,10 +1,10 @@ # Prefill-Decode Disaggregation (Qwen2.5-VL) -## Getting Start +## Getting Started vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources. -Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1. +Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1. ## Verify Communication Environment diff --git a/docs/source/tutorials/features/suffix_speculative_decoding.md b/docs/source/tutorials/features/suffix_speculative_decoding.md index 44df7d14..646b510a 100644 --- a/docs/source/tutorials/features/suffix_speculative_decoding.md +++ b/docs/source/tutorials/features/suffix_speculative_decoding.md @@ -133,7 +133,7 @@ models = [ ```bash # Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar. -ais_bench --models vllm_api_stream_chat \ +ais_bench --models vllm-api-stream-chat \ --datasets gsm8k_gen_0_shot_cot_str_perf \ --debug --summarizer default_perf --mode perf --num-prompts 100 ``` diff --git a/docs/source/tutorials/models/Kimi-K2.5.md b/docs/source/tutorials/models/Kimi-K2.5.md index f5bd5f98..6a87a40b 100644 --- a/docs/source/tutorials/models/Kimi-K2.5.md +++ b/docs/source/tutorials/models/Kimi-K2.5.md @@ -809,7 +809,7 @@ In this chapter, we recommend best practices for three scenarios: - **Q: Startup fails with HCCL port conflicts (address already bound). What should I do?** - A: Clean up old processes and restart: `pkill -f VLLM*`. + A: Clean up old processes and restart: `pkill -f vLLM*`. - **Q: How to handle OOM or unstable startup?** diff --git a/docs/source/tutorials/models/Qwen2.5-Omni.md b/docs/source/tutorials/models/Qwen2.5-Omni.md index b32c3ee8..03e50ce9 100644 --- a/docs/source/tutorials/models/Qwen2.5-Omni.md +++ b/docs/source/tutorials/models/Qwen2.5-Omni.md @@ -69,7 +69,7 @@ docker run --rm \ #### Single NPU (Qwen2.5-Omni-7B) :::{note} -The **environment variable** `LOCAL_MEDIA_PATH` which **allows** API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments. +The environment variable `LOCAL_MEDIA_PATH` which allows API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments. ::: @@ -128,7 +128,7 @@ Not supported yet. ## Functional Verification -If your service **starts** successfully, you can see the info shown below: +If your service starts successfully, you can see the info shown below: ```bash INFO: Started server process [2736] diff --git a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md index 7c00dd73..60d91094 100644 --- a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md +++ b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md @@ -16,7 +16,7 @@ Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user ### Model Weight -- `Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking) +- `Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards (64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking) It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` ### Installation @@ -283,7 +283,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```bash -VLLM_USE_MODELSCOPE=True +export VLLM_USE_MODELSCOPE=True export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy diff --git a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md index 09f534a8..488bab81 100644 --- a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md +++ b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md @@ -87,7 +87,6 @@ If you want to deploy multi-node environment, you need to set up environment on ### Single-node Deployment -`Qwen3.5-397B-A17B` can be deployed on 2 Atlas 800 A3(64G*16) or 4 Atlas 800 A2(64G*8). `Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 2 Atlas 800 A2(64G*8), need to start with parameter `--quantization ascend`. Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G*16). @@ -152,7 +151,7 @@ The parameters are explained as follows: ### Multi-node Deployment with MP (Recommended) -Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5-397B-A17B` model across multiple nodes. +Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5-397B-A17B-w8a8-mtp` model across multiple nodes. Node 0 @@ -277,247 +276,247 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl 1. Prefill Node 0 `run_p.sh` script -```shell -unset ftp_proxy -unset https_proxy -unset http_proxy -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxx" -local_ip="xxx" + ```shell + unset ftp_proxy + unset https_proxy + unset http_proxy + # this obtained through ifconfig + # nic_name is the network interface name corresponding to local_ip of the current node + nic_name="xxx" + local_ip="xxx" -# [Optional] jemalloc -# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. -# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD -export VLLM_ENGINE_READY_TIMEOUT_S=30000 -export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000 -export IP_ADDRESS=$local_ip -export NETWORK_CARD_NAME=$nic_name -export HCCL_IF_IP=$IP_ADDRESS -export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME -export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME -export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME -export VLLM_USE_V1=1 -export HCCL_BUFFSIZE=1536 -export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH -export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" -export VLLM_TORCH_PROFILER_WITH_STACK=0 -export TASK_QUEUE_ENABLE=1 + # [Optional] jemalloc + # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. + # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD + export VLLM_ENGINE_READY_TIMEOUT_S=30000 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000 + export IP_ADDRESS=$local_ip + export NETWORK_CARD_NAME=$nic_name + export HCCL_IF_IP=$IP_ADDRESS + export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME + export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME + export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME + export VLLM_USE_V1=1 + export HCCL_BUFFSIZE=1536 + export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH + export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" + export VLLM_TORCH_PROFILER_WITH_STACK=0 + export TASK_QUEUE_ENABLE=1 -export VLLM_ASCEND_ENABLE_FUSED_MC2=1 -export HCCL_OP_EXPANSION_MODE="AIV" + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export HCCL_OP_EXPANSION_MODE="AIV" -export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \ - --host ${IP_ADDRESS} \ - --port 30060 \ - --no-enable-prefix-caching \ - --enable-expert-parallel \ - --data-parallel-size 8 \ - --data-parallel-size-local 8 \ - --api-server-count 1 \ - --data-parallel-address ${IP_ADDRESS} \ - --max-num_seqs 64 \ - --data-parallel-rpc-port 6884 \ - --tensor-parallel-size 2 \ - --seed 1024 \ - --distributed-executor-backend mp \ - --served-model-name qwen3.5 \ - --max-model-len 16384 \ - --max-num-batched-tokens 4096 \ - --trust-remote-code \ - --quantization ascend \ - --no-disable-hybrid-kv-cache-manager \ - --speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \ - --additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \ - --gpu-memory-utilization 0.9 \ - --enforce-eager \ - --kv-transfer-config \ - '{"kv_connector": "MooncakeLayerwiseConnector", - "kv_role": "kv_producer", - "kv_port": "23010", - "engine_id": "0", - "kv_connector_extra_config": { - "prefill": { - "dp_size": 8, - "tp_size": 2 - }, - "decode": { - "dp_size": 16, - "tp_size": 2 - } - } - }' -``` + export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 + vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \ + --host ${IP_ADDRESS} \ + --port 30060 \ + --no-enable-prefix-caching \ + --enable-expert-parallel \ + --data-parallel-size 8 \ + --data-parallel-size-local 8 \ + --api-server-count 1 \ + --data-parallel-address ${IP_ADDRESS} \ + --max-num_seqs 64 \ + --data-parallel-rpc-port 6884 \ + --tensor-parallel-size 2 \ + --seed 1024 \ + --distributed-executor-backend mp \ + --served-model-name qwen3.5 \ + --max-model-len 16384 \ + --max-num-batched-tokens 4096 \ + --trust-remote-code \ + --quantization ascend \ + --no-disable-hybrid-kv-cache-manager \ + --speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \ + --additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \ + --gpu-memory-utilization 0.9 \ + --enforce-eager \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeLayerwiseConnector", + "kv_role": "kv_producer", + "kv_port": "23010", + "engine_id": "0", + "kv_connector_extra_config": { + "prefill": { + "dp_size": 8, + "tp_size": 2 + }, + "decode": { + "dp_size": 16, + "tp_size": 2 + } + } + }' + ``` -3. Decode Node 0 `run_d0.sh` script +2. Decode Node 0 `run_d0.sh` script -```shell -unset ftp_proxy -unset https_proxy -unset http_proxy -#!/bin/bash -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxx" -local_ip="xxx" -# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) -node0_ip="xxxx" + ```shell + unset ftp_proxy + unset https_proxy + unset http_proxy + #!/bin/bash + # this obtained through ifconfig + # nic_name is the network interface name corresponding to local_ip of the current node + nic_name="xxx" + local_ip="xxx" + # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) + node0_ip="xxxx" -export VLLM_ENGINE_READY_TIMEOUT_S=30000 -export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000 -export MASTER_IP_ADDRESS=$node0_ip -export IP_ADDRESS=$local_ip + export VLLM_ENGINE_READY_TIMEOUT_S=30000 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000 + export MASTER_IP_ADDRESS=$node0_ip + export IP_ADDRESS=$local_ip -export NETWORK_CARD_NAME=$nic_name + export NETWORK_CARD_NAME=$nic_name -export HCCL_IF_IP=$IP_ADDRESS -export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME -export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME -export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME + export HCCL_IF_IP=$IP_ADDRESS + export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME + export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME + export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME -export VLLM_USE_V1=1 -export HCCL_BUFFSIZE=1536 -export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH -export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" -export VLLM_TORCH_PROFILER_WITH_STACK=0 -export TASK_QUEUE_ENABLE=1 + export VLLM_USE_V1=1 + export HCCL_BUFFSIZE=1536 + export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH + export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" + export VLLM_TORCH_PROFILER_WITH_STACK=0 + export TASK_QUEUE_ENABLE=1 -export VLLM_ASCEND_ENABLE_FUSED_MC2=1 -export HCCL_OP_EXPANSION_MODE="AIV" -export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 -vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \ - --host ${IP_ADDRESS} \ - --port 30050 \ - --no-enable-prefix-caching \ - --enable-expert-parallel \ - --data-parallel-size 16 \ - --data-parallel-size-local 8 \ - --data-parallel-start-rank 0 \ - --api-server-count 1 \ - --data-parallel-address ${MASTER_IP_ADDRESS} \ - --max-num_seqs 32 \ - --data-parallel-rpc-port 6884 \ - --tensor-parallel-size 2 \ - --seed 1024 \ - --distributed-executor-backend mp \ - --served-model-name qwen3.5 \ - --max-model-len 16384 \ - --max-num-batched-tokens 128 \ - --trust-remote-code \ - --quantization ascend \ - --no-disable-hybrid-kv-cache-manager \ - --speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \ - --additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \ - --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ - --gpu-memory-utilization 0.96 \ - --kv-transfer-config \ - '{"kv_connector": "MooncakeLayerwiseConnector", - "kv_buffer_device": "npu", - "kv_role": "kv_consumer", - "kv_port": "36010", - "engine_id": "1", - "kv_connector_extra_config": { - "prefill": { - "dp_size": 8, - "tp_size": 2 - }, - "decode": { - "dp_size": 16, - "tp_size": 2 - } - } - }' -``` + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export HCCL_OP_EXPANSION_MODE="AIV" + export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 + vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \ + --host ${IP_ADDRESS} \ + --port 30050 \ + --no-enable-prefix-caching \ + --enable-expert-parallel \ + --data-parallel-size 16 \ + --data-parallel-size-local 8 \ + --data-parallel-start-rank 0 \ + --api-server-count 1 \ + --data-parallel-address ${MASTER_IP_ADDRESS} \ + --max-num_seqs 32 \ + --data-parallel-rpc-port 6884 \ + --tensor-parallel-size 2 \ + --seed 1024 \ + --distributed-executor-backend mp \ + --served-model-name qwen3.5 \ + --max-model-len 16384 \ + --max-num-batched-tokens 128 \ + --trust-remote-code \ + --quantization ascend \ + --no-disable-hybrid-kv-cache-manager \ + --speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \ + --additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \ + --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ + --gpu-memory-utilization 0.96 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeLayerwiseConnector", + "kv_buffer_device": "npu", + "kv_role": "kv_consumer", + "kv_port": "36010", + "engine_id": "1", + "kv_connector_extra_config": { + "prefill": { + "dp_size": 8, + "tp_size": 2 + }, + "decode": { + "dp_size": 16, + "tp_size": 2 + } + } + }' + ``` -5. Decode Node 1 `run_d1.sh` script +3. Decode Node 1 `run_d1.sh` script -```shell -unset ftp_proxy -unset https_proxy -unset http_proxy -#!/bin/bash -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxx" -local_ip="xxx" -# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) -node0_ip="xxxx" + ```shell + unset ftp_proxy + unset https_proxy + unset http_proxy + #!/bin/bash + # this obtained through ifconfig + # nic_name is the network interface name corresponding to local_ip of the current node + nic_name="xxx" + local_ip="xxx" + # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) + node0_ip="xxxx" -export VLLM_ENGINE_READY_TIMEOUT_S=30000 -export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000 -export MASTER_IP_ADDRESS=$node0_ip -export IP_ADDRESS=$local_ip + export VLLM_ENGINE_READY_TIMEOUT_S=30000 + export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=30000 + export MASTER_IP_ADDRESS=$node0_ip + export IP_ADDRESS=$local_ip -export NETWORK_CARD_NAME=$nic_name + export NETWORK_CARD_NAME=$nic_name -export HCCL_IF_IP=$IP_ADDRESS -export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME -export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME -export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME + export HCCL_IF_IP=$IP_ADDRESS + export GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME + export TP_SOCKET_IFNAME=$NETWORK_CARD_NAME + export HCCL_SOCKET_IFNAME=$NETWORK_CARD_NAME -export VLLM_USE_V1=1 -export HCCL_BUFFSIZE=1536 -export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH -export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" -export VLLM_TORCH_PROFILER_WITH_STACK=0 -export TASK_QUEUE_ENABLE=1 + export VLLM_USE_V1=1 + export HCCL_BUFFSIZE=1536 + export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH + export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" + export VLLM_TORCH_PROFILER_WITH_STACK=0 + export TASK_QUEUE_ENABLE=1 -export VLLM_ASCEND_ENABLE_FUSED_MC2=1 -export HCCL_OP_EXPANSION_MODE="AIV" -vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \ - --host ${IP_ADDRESS} \ - --port 30050 \ - --headless \ - --no-enable-prefix-caching \ - --enable-expert-parallel \ - --data-parallel-size 16 \ - --data-parallel-size-local 8 \ - --data-parallel-start-rank 8 \ - --data-parallel-address ${MASTER_IP_ADDRESS} \ - --max-num_seqs 32 \ - --data-parallel-rpc-port 6884 \ - --tensor-parallel-size 2 \ - --seed 1024 \ - --distributed-executor-backend mp \ - --served-model-name qwen3.5 \ - --max-model-len 16384 \ - --max-num-batched-tokens 128 \ - --trust-remote-code \ - --quantization ascend \ - --no-disable-hybrid-kv-cache-manager \ - --speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \ - --additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \ - --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ - --gpu-memory-utilization 0.96 \ - --kv-transfer-config \ - '{"kv_connector": "MooncakeLayerwiseConnector", - "kv_buffer_device": "npu", - "kv_role": "kv_consumer", - "kv_port": "36010", - "engine_id": "2", - "kv_connector_extra_config": { - "prefill": { - "dp_size": 8, - "tp_size": 2 - }, - "decode": { - "dp_size": 16, - "tp_size": 2 - } - } - }' -``` + export VLLM_ASCEND_ENABLE_FUSED_MC2=1 + export HCCL_OP_EXPANSION_MODE="AIV" + vllm serve Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp \ + --host ${IP_ADDRESS} \ + --port 30050 \ + --headless \ + --no-enable-prefix-caching \ + --enable-expert-parallel \ + --data-parallel-size 16 \ + --data-parallel-size-local 8 \ + --data-parallel-start-rank 8 \ + --data-parallel-address ${MASTER_IP_ADDRESS} \ + --max-num_seqs 32 \ + --data-parallel-rpc-port 6884 \ + --tensor-parallel-size 2 \ + --seed 1024 \ + --distributed-executor-backend mp \ + --served-model-name qwen3.5 \ + --max-model-len 16384 \ + --max-num-batched-tokens 128 \ + --trust-remote-code \ + --quantization ascend \ + --no-disable-hybrid-kv-cache-manager \ + --speculative_config '{"method": "qwen3_5_mtp", "num_speculative_tokens": 3, "enforce_eager": true}' \ + --additional-config '{"recompute_scheduler_enable": true, "enable_cpu_binding": true}' \ + --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ + --gpu-memory-utilization 0.96 \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeLayerwiseConnector", + "kv_buffer_device": "npu", + "kv_role": "kv_consumer", + "kv_port": "36010", + "engine_id": "2", + "kv_connector_extra_config": { + "prefill": { + "dp_size": 8, + "tp_size": 2 + }, + "decode": { + "dp_size": 16, + "tp_size": 2 + } + } + }' + ``` -**Notice:** -The parameters are explained as follows: + **Notice:** + The parameters are explained as follows: -- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency. -- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node. -- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously. -- `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944) + - `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency. + - `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node. + - `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously. + - `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944) -7. Run the `proxy.sh` script on the prefill master node +4. Run the `proxy.sh` script on the prefill master node Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py) diff --git a/docs/source/tutorials/models/Qwen3_reranker.md b/docs/source/tutorials/models/Qwen3_reranker.md index 2eef1ab1..2736171d 100644 --- a/docs/source/tutorials/models/Qwen3_reranker.md +++ b/docs/source/tutorials/models/Qwen3_reranker.md @@ -100,7 +100,7 @@ model_name = "Qwen/Qwen3-Reranker-8B" # It needs to computing 151669 tokens logits, making this method extremely # inefficient, not to mention incompatible with the vllm score API. # A method for converting the original model into a sequence classification -# model was proposed. See:https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 +# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 # Models converted offline using this method can not only be more efficient # and support the vllm score API, but also make the init parameters more # concise, for example. diff --git a/docs/source/user_guide/feature_guide/Fine_grained_TP.md b/docs/source/user_guide/feature_guide/Fine_grained_TP.md index d37425c5..0a1ca68a 100644 --- a/docs/source/user_guide/feature_guide/Fine_grained_TP.md +++ b/docs/source/user_guide/feature_guide/Fine_grained_TP.md @@ -1,4 +1,4 @@ -# Fine-Grained Tensor Parallelism (Finegrained TP) +# Fine-Grained Tensor Parallelism (Fine-grained TP) ## Overview @@ -8,12 +8,12 @@ This capability supports heterogeneous parallelism strategies within a single mo --- -## Benefits of Finegrained TP +## Benefits of Fine-grained TP Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding: - **Reduced Per-Device Memory Footprint**: - Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization. + Fine-grained TP shards large weight matrices (e.g., LM Head, o_proj) across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization. - **Faster Memory Access in GEMMs**: In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj. @@ -53,7 +53,7 @@ The Fine-Grained TP size for any component must: --- -## How to Use Finegrained TP +## How to Use Fine-grained TP ### Configuration Format diff --git a/docs/source/user_guide/feature_guide/cpu_binding.md b/docs/source/user_guide/feature_guide/cpu_binding.md index f2df9706..73ce9d2b 100644 --- a/docs/source/user_guide/feature_guide/cpu_binding.md +++ b/docs/source/user_guide/feature_guide/cpu_binding.md @@ -72,6 +72,7 @@ For best results, if you run inside a docker container, which `systemctl` is lik - **Stop `irqbalance` service**: For example, on Ubuntu system, you can run the following command to stop irqbalance: + ```bash sudo systemctl stop irqbalance ``` diff --git a/docs/source/user_guide/feature_guide/epd_disaggregation.md b/docs/source/user_guide/feature_guide/epd_disaggregation.md index 8659aef9..3b322792 100644 --- a/docs/source/user_guide/feature_guide/epd_disaggregation.md +++ b/docs/source/user_guide/feature_guide/epd_disaggregation.md @@ -56,12 +56,12 @@ All related code is under `vllm/distributed/ec_transfer`. * *Scheduler role* – checks cache existence and schedules loads. * *Worker role* – loads the embeddings into memory. -* **EPD Load Balance Proxy** - +* **EPD Load Balancing Proxy** - * *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path * *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances. We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide: -[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html) +[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html) For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation. `docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill. diff --git a/docs/source/user_guide/feature_guide/external_dp.md b/docs/source/user_guide/feature_guide/external_dp.md index 72643d9d..d0aae24b 100644 --- a/docs/source/user_guide/feature_guide/external_dp.md +++ b/docs/source/user_guide/feature_guide/external_dp.md @@ -4,7 +4,7 @@ For larger-scale deployments especially, it can make sense to handle the orchest In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions. -## Getting Start +## Getting Started The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities: diff --git a/docs/source/user_guide/feature_guide/kv_pool.md b/docs/source/user_guide/feature_guide/kv_pool.md index 6239f758..d2f3f2a5 100644 --- a/docs/source/user_guide/feature_guide/kv_pool.md +++ b/docs/source/user_guide/feature_guide/kv_pool.md @@ -163,7 +163,7 @@ export ASCEND_ENABLE_USE_FABRIC_MEM=1 #A2 #export HCCL_INTRA_ROCE_ENABLE=1 -#Minimum retransmission timeout of the RDMA,equals 4.096 μs * 2 ^ timeout. +#Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout. #Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer. #HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully. export HCCL_RDMA_TIMEOUT=17 diff --git a/docs/source/user_guide/feature_guide/large_scale_ep.md b/docs/source/user_guide/feature_guide/large_scale_ep.md index 10af22c1..c3b1275c 100644 --- a/docs/source/user_guide/feature_guide/large_scale_ep.md +++ b/docs/source/user_guide/feature_guide/large_scale_ep.md @@ -1,6 +1,6 @@ # Distributed DP Server With Large-Scale Expert Parallelism -## Getting Start +## Getting Started vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \ Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node. diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index a1986c66..b40054b9 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -68,7 +68,7 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th - [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. [#6806](https://github.com/vllm-project/vllm-ascend/pull/6806) - Qwen3-Next now supports FlashComm1. [#6830](https://github.com/vllm-project/vllm-ascend/pull/6830) - NPUWorker Profiler now supports profile_prefix for better profiling experience. [#6968](https://github.com/vllm-project/vllm-ascend/pull/6968) -- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. [#6877](https://github.com/vllm-project/vllm-ascend/pull/6877) [#7001](https://github.com/vllm-project/vllm-ascend/pull/7001)] +- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. [#6877](https://github.com/vllm-project/vllm-ascend/pull/6877) [#7001](https://github.com/vllm-project/vllm-ascend/pull/7001) - Xlite Qwen3 MoE now supports Data Parallel. [#6715](https://github.com/vllm-project/vllm-ascend/pull/6715) - Mooncake Layerwise Connector now supports kv_pool. [#7032](https://github.com/vllm-project/vllm-ascend/pull/7032) - Eagle3 now supports QuaRot quantization without embedding. [#7038](https://github.com/vllm-project/vllm-ascend/pull/7038) @@ -126,7 +126,7 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th - Fix Qwen-Omni quantization bugs. [#7042](https://github.com/vllm-project/vllm-ascend/pull/7042) [#7007](https://github.com/vllm-project/vllm-ascend/pull/7007) - Fix GDN layer accuracy in graph mode. [#6822](https://github.com/vllm-project/vllm-ascend/pull/6822) - Fix precision bugs for PCP/DCP in PD disaggregate. [#6876](https://github.com/vllm-project/vllm-ascend/pull/6876) - - Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. [#6948](https://github.com/vllm-project/vllm-ascend/pull/6948) + - Fix MTP in PD disaggregation with full graph support for all D-Nodes. [#6948](https://github.com/vllm-project/vllm-ascend/pull/6948) - Fix GQA model error when enabling both DP and DCP. [#7012](https://github.com/vllm-project/vllm-ascend/pull/7012) - Fix MTP prefill misclassified as decode edge case. [#6835](https://github.com/vllm-project/vllm-ascend/pull/6835) - Fix Eagle3 acceptance rate for QuaRot quantized models. [#6914](https://github.com/vllm-project/vllm-ascend/pull/6914) @@ -137,13 +137,13 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th - Fix GDN layer detection for multimodal models. [#6941](https://github.com/vllm-project/vllm-ascend/pull/6941) - Fix 300I unquant model weight nd2nz error. [#6851](https://github.com/vllm-project/vllm-ascend/pull/6851) - Fix CPU binding logic. [#6889](https://github.com/vllm-project/vllm-ascend/pull/6889) - - Fix Eagle fullgraph shape capture. [#6846](https://github.com/vllm-project/vllm-ascend/pull/6846) + - Fix Eagle full graph shape capture. [#6846](https://github.com/vllm-project/vllm-ascend/pull/6846) ### Known Issue - Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors. - In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests. -- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set `--compilation-config '{"max_cudagraph_capture_size": N}'` where `N = max_concurrency × (1 + num_speculative_tokens)`. +- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set `--compilation-config '{"max_cudagraph_capture_size": N}'` where `N = max_concurrency * (1 + num_speculative_tokens)`. ## v0.15.0rc1 - 2026.02.27 @@ -219,7 +219,7 @@ This is the first release candidate of v0.15.0 for vLLM Ascend. Please follow th ### Others -- **MTP in PD Fullgraph**: Fixed support for ALL D-Nodes in fullgraph when running MTP in PD deployment. [#5472](https://github.com/vllm-project/vllm-ascend/pull/5472) +- **MTP in PD Full graph**: Fixed support for ALL D-Nodes in full graph when running MTP in PD deployment. [#5472](https://github.com/vllm-project/vllm-ascend/pull/5472) - **DeepSeekV3.1 Accuracy**: Fixed DeepSeekV3.1 accuracy issue. [#6805](https://github.com/vllm-project/vllm-ascend/pull/6805) - **EAGLE Refactor**: Routed MTP to EAGLE except for PCP/DCP+MTP cases. [#6349](https://github.com/vllm-project/vllm-ascend/pull/6349) - **Speculative Decoding Accuracy**: Fixed spec acceptance rate problem in vLLM 0.15.0. [#6606](https://github.com/vllm-project/vllm-ascend/pull/6606) @@ -336,7 +336,7 @@ Many custom ops and triton kernels were added in this release to speed up model - **ZMQ Bug Fix**: Fixed zmq send/receive failed bug. [#5503](https://github.com/vllm-project/vllm-ascend/pull/5503) - **Weight Transpose**: Fixed weight transpose in RL scenarios. [#5567](https://github.com/vllm-project/vllm-ascend/pull/5567) - **Eagle3 SP**: Adapted SP to eagle3. [#5562](https://github.com/vllm-project/vllm-ascend/pull/5562) -- **GLM4.6 MTP**: GLM4.6 now supports MTP with fullgraph. [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460) +- **GLM4.6 MTP**: GLM4.6 now supports MTP with full graph. [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460) - **Flashcomm2 Oshard**: Flashcomm2 now works with oshard generalized feature. [#4723](https://github.com/vllm-project/vllm-ascend/pull/4723) - **Fine-grained Shared Expert Overlap**: Support fine-grained shared expert overlap. [#5962](https://github.com/vllm-project/vllm-ascend/pull/5962) @@ -432,7 +432,7 @@ We mainly focus on quality and performance improvement in this release. The spec - Fix weight transpose in RL scenarios [#5567](https://github.com/vllm-project/vllm-ascend/pull/5567) - Adapted SP to eagle3 [#5562](https://github.com/vllm-project/vllm-ascend/pull/5562) - Context Parallel(PCP&DCP) support mlapo [#5672](https://github.com/vllm-project/vllm-ascend/pull/5672) -- GLM4.6 support mtp with fullgraph [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460) +- GLM4.6 support mtp with full graph [#5460](https://github.com/vllm-project/vllm-ascend/pull/5460) - Flashcomm2 now works with oshard generalized feature [#4723](https://github.com/vllm-project/vllm-ascend/pull/4723) - Support setting tp=1 for the Eagle draft model [#5804](https://github.com/vllm-project/vllm-ascend/pull/5804) - Flashcomm1 feature now works with qwen3-vl [#5848](https://github.com/vllm-project/vllm-ascend/pull/5848) @@ -575,7 +575,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o - [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/KV_Cache_Pool_Guide.html) feature is added - [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it. - LLMdatadist kv connector is removed. Please use mooncake connector instead. -- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore. +- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}}` doesn't work anymore. - Torchair graph mode is removed. `--additional-config {"torchair_graph_config": {"enabled": true}}` doesn't work anymore. Please use aclgraph instead. - `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now. - speculative decode method `Ngram` is back now. @@ -585,7 +585,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o ### Upgrade Note - vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade. -- CANN is upgraded to 8.3.RC2, Pytorch and torch-npu are upgraded to 2.8.0. Don't forget to install them. +- CANN is upgraded to 8.3.RC2, PyTorch and torch-npu are upgraded to 2.8.0. Don't forget to install them. - Python 3.9 support is dropped to keep the same with vLLM v0.12.0 ### Known Issues @@ -683,7 +683,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release - For long sequence input case, there is no response sometimes and the kv cache usage is become higher. This is a bug for scheduler. We are working on it. - Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set `--gpu-memory-utilization` to a suitable value, such as 0.8. - When running Qwen3-Next with expert parallel enabled, please set `HCCL_BUFFSIZE` environment variable to a suitable value, such as 1024. -- The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set `cudagraph_capture_sizes` to a suitable value depending on the batch size for the input. +- The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set `agraph_capture_sizes` to a suitable value depending on the batch size for the input. ## v0.11.0rc0 - 2025.09.30 @@ -699,7 +699,7 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow - DeepSeek works with aclgraph now. [#2707](https://github.com/vllm-project/vllm-ascend/pull/2707) - MTP works with aclgraph now. [#2932](https://github.com/vllm-project/vllm-ascend/pull/2932) - EPLB is supported now. [#2956](https://github.com/vllm-project/vllm-ascend/pull/2956) -- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913) +- Mooncake store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913) - CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659) ### Others @@ -828,7 +828,7 @@ Please note that this release note will list all the important changes from last The following notes are especially for reference when upgrading from last final release (v0.7.3): - V0 Engine is not supported from this release. Please always set `VLLM_USE_V1=1` to use V1 engine with vLLM Ascend. -- Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed. +- Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future if needed. - Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them. ### Core @@ -893,7 +893,7 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the ### Core -- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Don’t forget to update them in your environment or using the latest images. +- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN has been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Don’t forget to update them in your environment or using the latest images. - vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582) - Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/docs/source/tutorials/multi_node_kimi.md) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162) - Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800) @@ -1055,7 +1055,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [ ## v0.9.2rc1 - 2025.07.11 -This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future. +This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be cleaned up in the future. ### Highlights @@ -1074,7 +1074,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [ ### Others - Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon. -- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331) +- Fix accuracy problem for Deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331) - A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335) - A new env variable `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future[#1732](https://github.com/vllm-project/vllm-ascend/pull/1732) - A batch of bugs have been fixed for Data Parallelism case [#1273](https://github.com/vllm-project/vllm-ascend/pull/1273) [#1322](https://github.com/vllm-project/vllm-ascend/pull/1322) [#1275](https://github.com/vllm-project/vllm-ascend/pull/1275) [#1478](https://github.com/vllm-project/vllm-ascend/pull/1478) @@ -1233,7 +1233,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir ### Highlights -- This release includes all features landed in the previous release candidates ([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)). And all the features are fully tested and verified. Visit the official doc the get the detail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) support matrix. +- This release includes all features landed in the previous release candidates ([v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2)). And all the features are fully tested and verified. Visit the official doc to get the detail [feature](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/suppoted_features.html) and [model](https://docs.vllm.ai/projects/ascend/en/v0.7.3/user_guide/supported_models.html) support matrix. - Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now. - Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#662](https://github.com/vllm-project/vllm-ascend/pull/662) - Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. [#708](https://github.com/vllm-project/vllm-ascend/pull/708) @@ -1256,7 +1256,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir ## v0.8.5rc1 - 2025.05.06 -This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.5rc1) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [supported_features](https://github.com/vllm-project/vllm-ascend/blob/v0.8.5rc1/docs/source/user_guide/suppoted_features.md). +This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.5rc1) to start the journey. Now you can enable V1 engine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [supported_features](https://github.com/vllm-project/vllm-ascend/blob/v0.8.5rc1/docs/source/user_guide/suppoted_features.md). ### Highlights