diff --git a/AGENTS.md b/AGENTS.md index d4ee37fb..fae6f26b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -99,7 +99,7 @@ pytest -sv tests/ut/ops/test_prepare_finalize.py pytest -sv tests/ut/ops/test_prepare_finalize.py::test_prepare_inputs # Run NPU-specific tests (requires NPU hardware) -pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency +pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency.py ``` **Requirement**: Run all tests locally before requesting review. Verify tests pass on NPU hardware for NPU-specific changes. @@ -163,7 +163,7 @@ pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency **Warning**: `tensor.item()` operations cause synchronization overhead on NPU when the `tensor` is on device. -If the `tensor` is a device tensor, the operator `item()` will trigger a synchronous data transfer from NPU to CPU, which can severely degrade performance in hot paths, cause this will make `AsyncScheduler` block here. +If the `tensor` is a device tensor, calling `item()` will triggers a synchronous data transfer from NPU to CPU. This can severely degrade performance in hot paths, causing `AsyncScheduler` to block here. **Review Requirements:** diff --git a/CLAUDE.md b/CLAUDE.md index 302cdeab..6758dbd1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1 +1 @@ -IMPORTANT: Ensure you’ve thoroughly reviewed the [AGENTS.md](AGENTS.md) file before beginning any work. +IMPORTANT: Ensure you've thoroughly reviewed the [AGENTS.md](AGENTS.md) file before beginning any work. diff --git a/README.md b/README.md index fce5d824..d9151afb 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ vLLM Ascend Plugin - [2025/12] We released the new official version [v0.11.0](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.11.0)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.11.0/) to start using vLLM Ascend Plugin on Ascend. - [2025/09] We released the new official version [v0.9.1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.9.1)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html) to start deploying large-scale Expert Parallelism (EP) on Ascend. - [2025/08] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q) with vLLM and Tencent! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF). -- [2025/06] [User stories](https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend‌ assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios. +- [2025/06] [User stories](https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios. - [2025/06] [Contributors](https://docs.vllm.ai/projects/ascend/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors. - [2025/05] We've released the first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html). - [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF). diff --git a/README.zh.md b/README.zh.md index 4ce3e127..292ce6b0 100644 --- a/README.zh.md +++ b/README.zh.md @@ -62,7 +62,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP ## 贡献 -请参考 [CONTRIBUTING]((https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。 +请参考[CONTRIBUTING](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。 我们欢迎并重视任何形式的贡献与合作: @@ -74,7 +74,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP vllm-ascend有主干分支和开发分支。 - **main**: 主干分支,与vLLM的主干分支对应,并通过昇腾CI持续进行质量看护。 -- **releases/vX.Y.Z**: 开发分支,随vLLM部分新版本发布而创建,比如`releases/v0.13.0`是vllm-asend针对vLLM `v0.13.0` 版本的开发分支。 +- **releases/vX.Y.Z**: 开发分支,随vLLM部分新版本发布而创建,比如`releases/v0.13.0`是vllm-ascend针对vLLM `v0.13.0` 版本的开发分支。 下面是维护中的分支: diff --git a/benchmarks/README.md b/benchmarks/README.md index 2a2fe8ec..ad573fbb 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -46,7 +46,7 @@ Before running the benchmarks, ensure the following: ``` - For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. -- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example: +- If you want to run a customized benchmark, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example: ```json [ diff --git a/docs/source/community/versioning_policy.md b/docs/source/community/versioning_policy.md index 4eb787fc..690656d3 100644 --- a/docs/source/community/versioning_policy.md +++ b/docs/source/community/versioning_policy.md @@ -171,7 +171,7 @@ Notes: ## Software dependency management -- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPI](https://pypi.org/project/torch-npu) +- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu) every 3 months, a development version (aka the POC version) every month, and a nightly version every day. - The PyPI stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in - vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version or branch. + The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in + vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in any vLLM Ascend version or branch. diff --git a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md index c9be6fb3..45e292cb 100644 --- a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md +++ b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md @@ -26,7 +26,7 @@ The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) throug Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth. -When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory. +When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory. ### 1. Combining KV Cache Pool with HBM Prefix Caching diff --git a/docs/source/developer_guide/Design_Documents/disaggregated_prefill.md b/docs/source/developer_guide/Design_Documents/disaggregated_prefill.md index 1f3acd87..9208c99b 100644 --- a/docs/source/developer_guide/Design_Documents/disaggregated_prefill.md +++ b/docs/source/developer_guide/Design_Documents/disaggregated_prefill.md @@ -8,7 +8,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)** Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**. 2. **Optimizing TPOT** - Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens. + Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system's **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens. --- @@ -28,7 +28,7 @@ For step-by-step deployment and configuration, refer to the following guide: ### 1. Design Approach -Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key–value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication. +Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key-value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication. ### 2. Implementation Design @@ -38,19 +38,19 @@ Our design diagram is shown below, illustrating the pull and push schemes respec #### Mooncake Connector -1. The request is sent to the Proxy’s `_handle_completions` endpoint. +1. The request is sent to the Proxy's `_handle_completions` endpoint. 2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`. -3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy. +3. After the P node's scheduler finishes prefill, `update_from_output` invokes the schedule connector's `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy. 4. The Proxy calls `select_decoder` to choose a D node and forwards the request. 5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result. #### Mooncake Layerwise Connector -1. The request is sent to the Proxy’s `_handle_completions` endpoint. +1. The request is sent to the Proxy's `_handle_completions` endpoint. 2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint. 3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete. -4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`. -5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding. +4. The Proxy's `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`. +5. During processing, the P node's scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding. 6. The D node performs decoding and returns the result. ### 3. Interface Design @@ -63,7 +63,7 @@ Taking MooncakeConnector as an example, the system is organized into three prima ### 4. Specifications Design -This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes. +This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving equal TP setups and certain unequal TP setups across multiple P and D nodes. | Feature | Status | |-------------------------------|----------------| diff --git a/docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md b/docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md index 6b1699f3..8ec26bfc 100644 --- a/docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md +++ b/docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md @@ -236,7 +236,7 @@ All method arguments must specify parameter types and default values, and functi #### Expert Map -The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps. +The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified of which ranks have inconsistent maps. During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency. #### Expert Weight diff --git a/docs/source/developer_guide/Design_Documents/npugraph_ex.md b/docs/source/developer_guide/Design_Documents/npugraph_ex.md index 82b97c30..9798f78c 100644 --- a/docs/source/developer_guide/Design_Documents/npugraph_ex.md +++ b/docs/source/developer_guide/Design_Documents/npugraph_ex.md @@ -2,18 +2,18 @@ ## How Does It Work? -This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode. +This is an optimization based on FX graphs, which can be considered an acceleration solution for the aclgraph mode. You can get its code [code](https://gitcode.com/Ascend/torchair) -## Default Fx Graph Optimization +## Default FX Graph Optimization -### Fx Graph pass +### FX Graph pass - For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance. - For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement. -### Fx fusion pass +### FX fusion pass npugraph_ex now provides three default operator fusion passes, and more will be added in the future. diff --git a/docs/source/developer_guide/Design_Documents/patch.md b/docs/source/developer_guide/Design_Documents/patch.md index 76c67409..37a7edf0 100644 --- a/docs/source/developer_guide/Design_Documents/patch.md +++ b/docs/source/developer_guide/Design_Documents/patch.md @@ -73,4 +73,4 @@ Before writing a patch, following the principle above, we should patch the least ## Limitations 1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely. -2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work. +2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for that version (e.g., v0.9.n) should work. diff --git a/docs/source/developer_guide/Design_Documents/quantization.md b/docs/source/developer_guide/Design_Documents/quantization.md index 6334dd27..b0c26379 100644 --- a/docs/source/developer_guide/Design_Documents/quantization.md +++ b/docs/source/developer_guide/Design_Documents/quantization.md @@ -24,7 +24,7 @@ The `embedding` method is generally not implemented for quantization, focusing o The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process. -We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**). +We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **MoE (Mixture of Experts)**). **Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example: @@ -107,7 +107,7 @@ vLLM Ascend supports multiple quantization algorithms. The following table provi | `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation | | `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) | | `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation | -| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node | +| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | We support two deployment modes: PD Colocation (dynamic quantization for both P and D) and PD Disaggregation (dynamic-quant P and static-quant D) | **Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision. diff --git a/docs/source/developer_guide/contribution/index.md b/docs/source/developer_guide/contribution/index.md index 2d50b9a8..f8d76cbe 100644 --- a/docs/source/developer_guide/contribution/index.md +++ b/docs/source/developer_guide/contribution/index.md @@ -34,11 +34,10 @@ bash format.sh #### Run CI locally -After completing "Run lint" setup, you can run CI locally: +After completing "Run lint" setup, you can run CI (Continuous integration) locally: ```{code-block} bash :substitutions: - cd ~/vllm-project/ # Run CI needs vLLM installed @@ -75,7 +74,7 @@ You can refer to [Testing](./testing.md) to set up a testing environment and ru ## DCO and Signed-off-by -When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO. +When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO (Developer Certificate of Origin). Using `-s` with `git commit` will automatically add this header. diff --git a/docs/source/developer_guide/contribution/multi_node_test.md b/docs/source/developer_guide/contribution/multi_node_test.md index db2f81e2..4fd6e999 100644 --- a/docs/source/developer_guide/contribution/multi_node_test.md +++ b/docs/source/developer_guide/contribution/multi_node_test.md @@ -119,7 +119,7 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/ See Which can be used as a reference -- Step 2. Deploy the following yaml file `lws.yaml` as what you want +- Step 2. Deploy the following yaml file `lws.yaml` as needed ```yaml apiVersion: leaderworkerset.x-k8s.io/v1 diff --git a/docs/source/developer_guide/evaluation/using_ais_bench.md b/docs/source/developer_guide/evaluation/using_ais_bench.md index a83e9ddd..d61a88a8 100644 --- a/docs/source/developer_guide/evaluation/using_ais_bench.md +++ b/docs/source/developer_guide/evaluation/using_ais_bench.md @@ -161,7 +161,7 @@ There are several arguments that you should update according to your environment - `path`: Update to your model weight path. - `model`: Update to your model name in vLLM. - `host_ip` and `host_port`: Update to your vLLM server ip and port. -- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets. +- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max_model_len`(config in your vllm server), `32768` will be suitable for most datasets. - `batch_size`: Update according to your dataset. - `temperature`: Update inference argument. diff --git a/docs/source/developer_guide/evaluation/using_evalscope.md b/docs/source/developer_guide/evaluation/using_evalscope.md index 89f05cf7..a6185839 100644 --- a/docs/source/developer_guide/evaluation/using_evalscope.md +++ b/docs/source/developer_guide/evaluation/using_evalscope.md @@ -65,7 +65,7 @@ pip install gradio plotly evalscope ## 3. Run GSM8K using EvalScope for accuracy testing -You can use `evalscope eval` to run GSM8K for accuracy testing: +You can use `evalscope eval` to run GSM8K (a grade-school math benchmark dataset) for accuracy testing: ```shell evalscope eval \ diff --git a/docs/source/developer_guide/evaluation/using_lm_eval.md b/docs/source/developer_guide/evaluation/using_lm_eval.md index 28c56849..c2154117 100644 --- a/docs/source/developer_guide/evaluation/using_lm_eval.md +++ b/docs/source/developer_guide/evaluation/using_lm_eval.md @@ -43,7 +43,7 @@ INFO: Waiting for application startup. INFO: Application startup complete. ``` -### 2. Run GSM8K using lm-eval for accuracy testing +### 2. Run GSM8K using the vLLM server (curl) and then run lm-eval for accuracy testing You can query the result with input prompts: diff --git a/docs/source/developer_guide/evaluation/using_opencompass.md b/docs/source/developer_guide/evaluation/using_opencompass.md index c1e7d77e..9ae183bb 100644 --- a/docs/source/developer_guide/evaluation/using_opencompass.md +++ b/docs/source/developer_guide/evaluation/using_opencompass.md @@ -53,7 +53,7 @@ curl http://localhost:8000/v1/completions \ }' ``` -## 2. Run C-Eval using OpenCompass for accuracy testing +## 2. Run C-Eval (a Chinese language model evaluation benchmark) using OpenCompass for accuracy testing Install OpenCompass and configure the environment variables in the container: diff --git a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md index 7071156e..6db78c5e 100644 --- a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md +++ b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md @@ -88,7 +88,7 @@ Navigate to the `./vllm_profile` directory and locate the generated `*ascend_pt` ```python from torch_npu.profiler.profiler import analyse -analyse("./vllm_profile/localhost.localdomain_XXXXXXXXXX_ascend_pt/") +analyse("./vllm_profile/localhost.localdomain_*_ascend_pt/") ``` ### 5. View Results diff --git a/docs/source/faqs.md b/docs/source/faqs.md index 79ba3ad3..97310ec5 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -146,7 +146,7 @@ In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynam ### 13. Failed to enable NPU graph mode when running DeepSeek -Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update. +Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA (Multi-Head Latent Attention) and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update. And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, `num_heads`/`num_kv_heads` is {32, 64, 128}. diff --git a/docs/source/installation.md b/docs/source/installation.md index 9502d839..1313a3fe 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -244,7 +244,7 @@ docker run --rm \ -it $IMAGE bash ``` -The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developer immediately take place changes without requiring a new installation. +The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developers immediately make changes without requiring a new installation. ## Extra information diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md index fa15b399..cb7dcaec 100644 --- a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md +++ b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md @@ -302,7 +302,7 @@ bash proxy.sh The parameters are explained as follows: - `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes. -- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes. +- `--prefill-context-parallel-size` 2 is common setting for prefill context parallelism (PCP) sizes. - `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes. - `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request. - `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency. diff --git a/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md b/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md index 6fa5d359..157887f7 100644 --- a/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md +++ b/docs/source/tutorials/features/pd_disaggregation_mooncake_multi_node.md @@ -4,7 +4,7 @@ vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources. -Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance. +Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs and 16 chips to deploy one service instance. ## Verify Multi-Node Communication Environment @@ -735,14 +735,14 @@ vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \ ### Start the service ```bash -# on 190.0.0.1 -python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100 -# on 190.0.0.2 -python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100 -# on 190.0.0.3 -python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100 -# on 190.0.0.4 -python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100 +# on 192.0.0.1 +python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 192.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100 +# on 192.0.0.2 +python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 192.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100 +# on 192.0.0.3 +python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 192.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100 +# on 192.0.0.4 +python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 192.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100 ``` ## Example Proxy for Deployment diff --git a/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md b/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md index e96bf733..37a5496d 100644 --- a/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md +++ b/docs/source/tutorials/features/pd_disaggregation_mooncake_single_node.md @@ -2,9 +2,9 @@ ## Getting Started -vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources. +vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide provides step-by-step instructions to verify this features in resource-constrained environments. -Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1. +Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture (one Prefiller and one Decoder on the same node). Assume the IP address is 192.0.0.1. ## Verify Communication Environment diff --git a/docs/source/tutorials/models/DeepSeek-V3.1.md b/docs/source/tutorials/models/DeepSeek-V3.1.md index ae1e1ee4..79530990 100644 --- a/docs/source/tutorials/models/DeepSeek-V3.1.md +++ b/docs/source/tutorials/models/DeepSeek-V3.1.md @@ -257,7 +257,7 @@ Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) - `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16). -To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests. +To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_online_dp.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests. 1. `launch_online_dp.py` to launch external dp vllm servers. [launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py) diff --git a/docs/source/tutorials/models/GLM4.x.md b/docs/source/tutorials/models/GLM4.x.md index 773a48cc..ed331399 100644 --- a/docs/source/tutorials/models/GLM4.x.md +++ b/docs/source/tutorials/models/GLM4.x.md @@ -22,7 +22,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea - `GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6). - `GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7). - `GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8). -- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights. +- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm does not support GLM4.6 mtp in October, we do not provide an mtp version. Last month, it was supported; you can use the following quantization scheme to add mtp weights to the quantized weights. - `GLM-4.7-w8a8-with-float-mtp`(Quantized version without mtp): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8-floatmtp). - `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model. @@ -38,7 +38,7 @@ You can use our official docker image to run `GLM-4.x` directly. ::::{tab-item} A3 series :sync: A3 -Start the docker image on your each node. +Start the docker image on each node. ```{code-block} bash :substitutions: diff --git a/docs/source/tutorials/models/GLM5.md b/docs/source/tutorials/models/GLM5.md index e2c2e66a..f0710d38 100644 --- a/docs/source/tutorials/models/GLM5.md +++ b/docs/source/tutorials/models/GLM5.md @@ -2,7 +2,7 @@ ## Introduction -[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks. +[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targets at complex systems engineering and long-horizon agentic tasks. The `GLM-5` model is first supported in `vllm-ascend:v0.17.0rc1`. In `vllm-ascend:v0.17.0rc1` and `vllm-ascend:v0.18.0rc1` , the version of transformers need to be upgraded to 5.2.0. diff --git a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md index 60d91094..0e123c57 100644 --- a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md +++ b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md @@ -2,7 +2,7 @@ ## Introduction -Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. +Qwen3-Omni is a native end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, which contains the thinker component, is equipped with chain-of-thought reasoning and supports audio, video, and text input, with text output. This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation. diff --git a/docs/source/tutorials/models/Qwen3-VL-Reranker.md b/docs/source/tutorials/models/Qwen3-VL-Reranker.md index 5bd55bf6..cb5a5b9d 100644 --- a/docs/source/tutorials/models/Qwen3-VL-Reranker.md +++ b/docs/source/tutorials/models/Qwen3-VL-Reranker.md @@ -130,14 +130,14 @@ model_name = "Qwen/Qwen3-VL-Reranker-8B" # What is the difference between the official original version and one # that has been converted into a sequence classification model? -# Qwen3-Reranker is a language model that doing reranker by using the +# Qwen3-VL-Reranker is a language model that doing reranker by using the # logits of "no" and "yes" tokens. -# It needs to computing 151669 tokens logits, making this method extremely -# inefficient, not to mention incompatible with the vllm score API. +# It needs to compute 151669 tokens logits, making this method extremely +# inefficient, not to mention incompatible with the vLLM score API. # A method for converting the original model into a sequence classification # model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 # Models converted offline using this method can not only be more efficient -# and support the vllm score API, but also make the init parameters more +# and support the vLLM score API, but also make the init parameters more # concise, for example. # model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling") @@ -163,9 +163,9 @@ model = LLM( ) # Why do we need hf_overrides for the official original version: -# vllm converts it to Qwen3VLForSequenceClassification when loaded for +# vLLM converts it to Qwen3VLForSequenceClassification when loaded for # better performance. -# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],` +# - Firstly, we need to use `"architectures": ["Qwen3VLForSequenceClassification"],` # to manually route to Qwen3VLForSequenceClassification. # - Then, we will extract the vector corresponding to classifier_from_token # from lm_head using `"classifier_from_token": ["no", "yes"]`. diff --git a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md index 488bab81..8204bcb3 100644 --- a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md +++ b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md @@ -514,7 +514,7 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl - `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency. - `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node. - `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously. - - `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944) + - `--no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944) 4. Run the `proxy.sh` script on the prefill master node diff --git a/docs/source/tutorials/models/Qwen3_embedding.md b/docs/source/tutorials/models/Qwen3_embedding.md index a7c497d8..81d45f7e 100644 --- a/docs/source/tutorials/models/Qwen3_embedding.md +++ b/docs/source/tutorials/models/Qwen3_embedding.md @@ -95,7 +95,7 @@ Processed prompts: 100%|██████████████████ ## Performance -Run performance of `Qwen3-Reranker-8B` as an example. +Run performance of `Qwen3-Embedding-8B` as an example. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details. Take the `serve` as an example. Run the code as follows. diff --git a/docs/source/tutorials/models/Qwen3_reranker.md b/docs/source/tutorials/models/Qwen3_reranker.md index 2736171d..ba30df56 100644 --- a/docs/source/tutorials/models/Qwen3_reranker.md +++ b/docs/source/tutorials/models/Qwen3_reranker.md @@ -95,14 +95,14 @@ model_name = "Qwen/Qwen3-Reranker-8B" # What is the difference between the official original version and one # that has been converted into a sequence classification model? -# Qwen3-Reranker is a language model that doing reranker by using the +# Qwen3-Reranker is a language model that does reranker by using the # logits of "no" and "yes" tokens. -# It needs to computing 151669 tokens logits, making this method extremely -# inefficient, not to mention incompatible with the vllm score API. +# It needs to compute 151669 tokens logits, making this method extremely +# inefficient, not to mention incompatible with the vLLM score API. # A method for converting the original model into a sequence classification # model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 # Models converted offline using this method can not only be more efficient -# and support the vllm score API, but also make the init parameters more +# and support the vLLM score API, but also make the init parameters more # concise, for example. # model = LLM(model="Qwen/Qwen3-Reranker-8B", task="score") @@ -120,7 +120,7 @@ model = LLM( ) # Why do we need hf_overrides for the official original version: -# vllm converts it to Qwen3ForSequenceClassification when loaded for +# vLLM converts it to Qwen3ForSequenceClassification when loaded for # better performance. # - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],` # to manually route to Qwen3ForSequenceClassification. diff --git a/docs/source/user_guide/configuration/additional_config.md b/docs/source/user_guide/configuration/additional_config.md index a6b10c56..e96e7592 100644 --- a/docs/source/user_guide/configuration/additional_config.md +++ b/docs/source/user_guide/configuration/additional_config.md @@ -30,7 +30,7 @@ The following table lists additional configuration options available in vLLM Asc | `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch | | `finegrained_tp_config` | dict | `{}` | Configuration options for module tensor parallelism | | `ascend_compilation_config` | dict | `{}` | Configuration options for ascend compilation | -| `eplb_config` | dict | `{}` | Configuration options for ascend compilation | +| `eplb_config` | dict | `{}` | Configuration options for eplb | | `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. | | `dump_config_path` | str | `None` | Configuration file path for msprobe dump(eager mode). | | `enable_async_exponential` | bool | `False` | Whether to enable asynchronous exponential overlap. To enable asynchronous exponential, set this config to True. | diff --git a/docs/source/user_guide/feature_guide/batch_invariance.md b/docs/source/user_guide/feature_guide/batch_invariance.md index 6a042dce..9d6c0efe 100644 --- a/docs/source/user_guide/feature_guide/batch_invariance.md +++ b/docs/source/user_guide/feature_guide/batch_invariance.md @@ -23,7 +23,7 @@ We will support other NPUs in the future. ## Software Requirements -Batch invariance requires a customed operator library for 910B. +Batch invariance requires a custom operator library for 910B. We will release the customed operator library in future versions. ## Enabling Batch Invariance diff --git a/docs/source/user_guide/feature_guide/cpu_binding.md b/docs/source/user_guide/feature_guide/cpu_binding.md index 73ce9d2b..10004874 100644 --- a/docs/source/user_guide/feature_guide/cpu_binding.md +++ b/docs/source/user_guide/feature_guide/cpu_binding.md @@ -67,12 +67,11 @@ sudo dnf install -y util-linux numactl procps-ng ### IRQ binding's additional considerations -For best results, if you run inside a docker container, which `systemctl` is likely unavailable, stop `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding: +For best results, if you run inside a Docker container where `systemctl` is likely unavailable, stop the `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding: - **Stop `irqbalance` service**: For example, on Ubuntu system, you can run the following command to stop irqbalance: - ```bash sudo systemctl stop irqbalance ``` diff --git a/docs/source/user_guide/feature_guide/dynamic_batch.md b/docs/source/user_guide/feature_guide/dynamic_batch.md index 03681aba..8b7855b1 100644 --- a/docs/source/user_guide/feature_guide/dynamic_batch.md +++ b/docs/source/user_guide/feature_guide/dynamic_batch.md @@ -21,7 +21,7 @@ We are working on further improvements and this feature will support more XPUs i ### Tuning Parameters -`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements. +`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements. ```python --SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled. diff --git a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md index 0029e82d..410ca0e6 100644 --- a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md +++ b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md @@ -2,7 +2,7 @@ ## Overview -Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity. +Expert balancing for MoE (Mixture of Experts) models in LLM (Large Language) serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity. ## EPLB Effects diff --git a/docs/source/user_guide/feature_guide/kv_pool.md b/docs/source/user_guide/feature_guide/kv_pool.md index d2f3f2a5..8edb4d73 100644 --- a/docs/source/user_guide/feature_guide/kv_pool.md +++ b/docs/source/user_guide/feature_guide/kv_pool.md @@ -97,7 +97,7 @@ export PYTHONHASHSEED=0 | :--- | :--- | :--- | :--- | | 800 I/T A3 series | HDK >= 26.0.0
CANN >= 9.0.0 | `export ASCEND_ENABLE_USE_FABRIC_MEM=1` | **Recommended**. Enables unified memory address direct transmission scheme. | | 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). | -| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series| +| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission scheme on 800 I/T A2 series| ### FAQ for HIXL (ascend_direct) backend diff --git a/docs/source/user_guide/feature_guide/layer_sharding.md b/docs/source/user_guide/feature_guide/layer_sharding.md index 770a50cb..5162bec1 100644 --- a/docs/source/user_guide/feature_guide/layer_sharding.md +++ b/docs/source/user_guide/feature_guide/layer_sharding.md @@ -2,9 +2,9 @@ ## Overview -**Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights. +**Layer Sharding Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights. -Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**: +Instead of replicating all weights on every device, **Layer Sharding Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**: - The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group. - Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand** via asynchronous broadcast during the forward pass. @@ -23,13 +23,13 @@ This approach **preserves exact computational semantics** while **significantly ![layer shard](./images/layer_sharding.png) -> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes—enabling **zero-overhead** weight loading. +> **Figure.** Layer Sharding Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes-enabling **zero-overhead** weight loading. --- ## Getting Started -To enable **Layer Shard Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use: +To enable **Layer Sharding Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use: ```bash --additional-config '{ diff --git a/docs/source/user_guide/feature_guide/npugraph_ex.md b/docs/source/user_guide/feature_guide/npugraph_ex.md index b0a7b2b2..dfe76ee4 100644 --- a/docs/source/user_guide/feature_guide/npugraph_ex.md +++ b/docs/source/user_guide/feature_guide/npugraph_ex.md @@ -4,7 +4,7 @@ As introduced in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715), this is a simple ACLGraph graph mode acceleration solution based on Fx graphs. -## Using npugraph_ex +## Using Npugraph_ex Npugraph_ex will be enabled by default in the future, Take Qwen series models as an example to show how to configure it. diff --git a/docs/source/user_guide/feature_guide/sleep_mode.md b/docs/source/user_guide/feature_guide/sleep_mode.md index ba7a0bb0..ddc7ebce 100644 --- a/docs/source/user_guide/feature_guide/sleep_mode.md +++ b/docs/source/user_guide/feature_guide/sleep_mode.md @@ -8,7 +8,7 @@ Since the generation and training phases may employ different model parallelism ## Getting started -With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`. +With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vLLM is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`. The engine (v0/v1) supports two sleep levels to manage memory during idle periods: diff --git a/docs/source/user_guide/feature_guide/weight_prefetch.md b/docs/source/user_guide/feature_guide/weight_prefetch.md index 1bb47379..4a20476e 100644 --- a/docs/source/user_guide/feature_guide/weight_prefetch.md +++ b/docs/source/user_guide/feature_guide/weight_prefetch.md @@ -6,7 +6,7 @@ Since we use vector computations to hide the weight prefetching pipeline, this h ## Quick Start -With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to open weight prefetch. +Use `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to enable weight prefetch. ## Fine-tune Prefetch Ratio diff --git a/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md b/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md index 49d988fb..71d22675 100644 --- a/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md +++ b/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md @@ -71,10 +71,10 @@ vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \ `HCCL_EXEC_TIMEOUT`, `HCCL_CONNECT_TIMEOUT`, and `HCCL_IF_IP` are hccl-related configurations.
Set `GLOO_SOCKET_IFNAME`, `TP_SOCKET_IFNAME`, and `HCCL_SOCKET_IFNAME` to the corresponding NIC.
`ASCEND_RT_VISIBLE_DEVICES` specifies the cards on which the node run resides. The total number of cards equals `dp_size*tp_size`.
-`/xxxxx/DeepSeek-V2-Lite-Chat` is configured as a model that requires run.
+`/xxxxx/DeepSeek-V2-Lite-Chat` is configured as a model that requires running.
`--host`: indicates the IP address of the node to be started.
-`--port`: indicates the port to be started, which corresponds to the port in step 4.
-`--seed`, --max-model-len, and --max-num-batched-tokens model basic configuration. Set this parameter based on the site requirements.
+`--port`: indicates the port on which the prefill node will listen (e.g., 8100). This port is later referenced in step 3 when configuring the proxy server.
+`--seed`: `--max-model-len`, and `--max-num-batched-tokens` are part of the model's basic configuration. Set this parameter based on the site requirements.
`--tensor-parallel-size`: specifies the TP size.
`--data-parallel-size`: indicates the DP size.
`--data-parallel-address`: indicates the IP address of the DP. Set this parameter to the IP address of the node.--data-parallel-rpc-port: indicates the RPC port for communication in the DP group.
@@ -144,7 +144,7 @@ python load_balance_proxy_server_example.py --host localhost --prefiller-hosts h `--host`: indicates the active node. The value of localhost in the curl command delivered in step 5 must be the same as the host. The default port number for starting the service proxy is 8000.
`--prefiller-hosts`: Set this parameter to the IP addresses of all p nodes. In the xpyd scenario, add the IP addresses to the end of this configuration item and leave a blank space between the IP addresses.
-`--prefiller-ports`: Set this parameter to the port number of all p nodes, which is the configuration of the port number for the vllm to start the service in step 3. Write the port number after the configuration in sequence and leave a blank space between the port number and the port number. The sequence must be one-to-one mapping to the IP address of --prefiller-hosts.
+`--prefiller-ports`: Set this parameter to the port numbers of all prefill (P) nodes, which were defined in step 1 when starting the prefill nodes. Write the port number after the configuration in sequence and leave a blank space between the port number and the port number. The sequence must be one-to-one mapping to the IP address of --prefiller-hosts.
`--decoder-hosts`: Set this parameter to the IP addresses of all d nodes. In the xpyd scenario, add the IP addresses to the end of this configuration item and leave a blank space between the IP addresses.
`--decoder-ports`: Set this parameter to the port number of all d nodes, which is the configuration of the port number for the vllm to start the service in step 4. Set port to the end of the configuration, and leave a blank space between port and port. The sequence must be one-to-one mapping to the IP address of --decoder-hosts.
diff --git a/examples/epd_disaggregated/epd_disaggregated_guide.md b/examples/epd_disaggregated/epd_disaggregated_guide.md index a38268bd..79c9157d 100644 --- a/examples/epd_disaggregated/epd_disaggregated_guide.md +++ b/examples/epd_disaggregated/epd_disaggregated_guide.md @@ -11,7 +11,7 @@ ## run -The EPD disaggregated technology accelerates model inference by decoupling the visual encoding computation and LLM computation stages. Currently, the EPD separation feature can achieve different data transmissions between E and P/PD nodes by configuring different connector backends. Vllm-ascend currently supports the ECexample-connector backend implemented on vllm, and will support Mooncake as well as shared memory(SHM) backend transmission methods in the future. +The EPD disaggregated technology accelerates model inference by decoupling the visual encoding computation and LLM computation stages. Currently, the EPD separation feature can achieve different data transmissions between E and P/PD nodes by configuring different connector backends. Vllm-ascend currently supports the ECExampleConnector backend implemented on vllm, and will support Mooncake as well as shared memory(SHM) backend transmission methods in the future. ### ECexample-connector deployment guide @@ -114,7 +114,7 @@ python3 epd_load_balance_proxy_layerwise_server_example.py \ --port 8001 ``` -TODO: explain the param.
+The parameters are explained as follows:
`--encoder-hosts`: E node IP address.
`--encoder-ports`: The E node port number. It needs to be consistent with the --port in the E node's startup script.
`--pd-hosts`: PD node IP address.
diff --git a/examples/external_online_dp/README.md b/examples/external_online_dp/README.md index a02ab75c..99f0ac52 100644 --- a/examples/external_online_dp/README.md +++ b/examples/external_online_dp/README.md @@ -1,8 +1,8 @@ -Here is an example guiding how to use `launch_online_dp.py` to launch external dp vllm servers. User can easily launch external dp servers following the steps below: +Here is an example guiding how to use `launch_online_dp.py` to launch external dp vLLM servers. User can easily launch external dp servers following the steps below: ### Modify parameters in `run_dp_template.sh` -`run_dp_template.sh` is an template script used to launch each dp vllm instance separately. It will be called by `launch_online_dp.py` in multi threads and most of its configurations are set by `launch_online_dp.py`. Parameters you need to set manually include: +`run_dp_template.sh` is a template script used to launch each data parallel (dp) vLLM instance separately. It will be called by `launch_online_dp.py` in multiple threads and most of its configurations are set by `launch_online_dp.py`. Parameters you need to set manually include: 1. The IP and socket_ifname of your machine. If running on multi-nodes, please make sure the scripts on each node has been set with correct IP and socket_ifname of that node. 2. vLLM serving related parameters including model_path and other configurations. Note that port, dp-related parameters and tp_size is set by `launch_online_dp.py`, all the other vLLM parameters in this file only serve as an example and you are free to modify them according to your purpose. @@ -42,7 +42,7 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s ### (Optional) Run `dp_load_balance_proxy_server.py` to load balance requests between external dp servers -External dp server means that you need to handle load balance between multiple dp instances out of vllm by implementing your custom proxy server. Here we provide an example of request-length-aware dp load-balance proxy server for you. The arguments of `dp_load_balance_proxy_server.py` include: +External dp server means that you need to handle load balance between multiple dp instances out of vLLM by implementing your custom proxy server. Here we provide an example of request-length-aware dp load-balance proxy server for you. The arguments of `dp_load_balance_proxy_server.py` include: 1. `--port`: port of proxy server, default 8000 2. `--host`: host address of proxy server, default localhost diff --git a/tests/e2e/models/report_template.md b/tests/e2e/models/report_template.md index 088744d1..630c4f18 100644 --- a/tests/e2e/models/report_template.md +++ b/tests/e2e/models/report_template.md @@ -30,5 +30,5 @@ lm_eval --model {{ model_type }} --model_args $MODEL_ARGS \ | Task | Metric | Value | Stderr | |-----------------------|-------------|----------:|-------:| {% for row in rows -%} -| {{ row.task }} | {{ row.metric }} | {{ row.value }} | ± {{ "%.4f" | format(row.stderr | float) }} | +| {{ row.task }} | {{ row.metric }} | {{ row.value }} | ± {{ "%.4f" \| format(row.stderr \| float) }} | {% endfor %}