[Doc][Misc] Improve readability and fix typos in documentation (#8340)
### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>
This commit is contained in:
@@ -99,7 +99,7 @@ pytest -sv tests/ut/ops/test_prepare_finalize.py
|
||||
pytest -sv tests/ut/ops/test_prepare_finalize.py::test_prepare_inputs
|
||||
|
||||
# Run NPU-specific tests (requires NPU hardware)
|
||||
pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency
|
||||
pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency.py
|
||||
```
|
||||
|
||||
**Requirement**: Run all tests locally before requesting review. Verify tests pass on NPU hardware for NPU-specific changes.
|
||||
@@ -163,7 +163,7 @@ pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency
|
||||
|
||||
**Warning**: `tensor.item()` operations cause synchronization overhead on NPU when the `tensor` is on device.
|
||||
|
||||
If the `tensor` is a device tensor, the operator `item()` will trigger a synchronous data transfer from NPU to CPU, which can severely degrade performance in hot paths, cause this will make `AsyncScheduler` block here.
|
||||
If the `tensor` is a device tensor, calling `item()` will triggers a synchronous data transfer from NPU to CPU. This can severely degrade performance in hot paths, causing `AsyncScheduler` to block here.
|
||||
|
||||
**Review Requirements:**
|
||||
|
||||
|
||||
@@ -1 +1 @@
|
||||
IMPORTANT: Ensure you’ve thoroughly reviewed the [AGENTS.md](AGENTS.md) file before beginning any work.
|
||||
IMPORTANT: Ensure you've thoroughly reviewed the [AGENTS.md](AGENTS.md) file before beginning any work.
|
||||
|
||||
@@ -30,7 +30,7 @@ vLLM Ascend Plugin
|
||||
- [2025/12] We released the new official version [v0.11.0](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.11.0)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.11.0/) to start using vLLM Ascend Plugin on Ascend.
|
||||
- [2025/09] We released the new official version [v0.9.1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.9.1)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html) to start deploying large-scale Expert Parallelism (EP) on Ascend.
|
||||
- [2025/08] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q) with vLLM and Tencent! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
|
||||
- [2025/06] [User stories](https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
|
||||
- [2025/06] [User stories](https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
|
||||
- [2025/06] [Contributors](https://docs.vllm.ai/projects/ascend/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors.
|
||||
- [2025/05] We've released the first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
|
||||
- [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
|
||||
|
||||
@@ -62,7 +62,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
|
||||
|
||||
## 贡献
|
||||
|
||||
请参考 [CONTRIBUTING]((https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
|
||||
请参考[CONTRIBUTING](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
|
||||
|
||||
我们欢迎并重视任何形式的贡献与合作:
|
||||
|
||||
@@ -74,7 +74,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
|
||||
vllm-ascend有主干分支和开发分支。
|
||||
|
||||
- **main**: 主干分支,与vLLM的主干分支对应,并通过昇腾CI持续进行质量看护。
|
||||
- **releases/vX.Y.Z**: 开发分支,随vLLM部分新版本发布而创建,比如`releases/v0.13.0`是vllm-asend针对vLLM `v0.13.0` 版本的开发分支。
|
||||
- **releases/vX.Y.Z**: 开发分支,随vLLM部分新版本发布而创建,比如`releases/v0.13.0`是vllm-ascend针对vLLM `v0.13.0` 版本的开发分支。
|
||||
|
||||
下面是维护中的分支:
|
||||
|
||||
|
||||
@@ -46,7 +46,7 @@ Before running the benchmarks, ensure the following:
|
||||
```
|
||||
|
||||
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
|
||||
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
|
||||
- If you want to run a customized benchmark, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
|
||||
|
||||
```json
|
||||
[
|
||||
|
||||
@@ -171,7 +171,7 @@ Notes:
|
||||
|
||||
## Software dependency management
|
||||
|
||||
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPI](https://pypi.org/project/torch-npu)
|
||||
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
|
||||
every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
|
||||
The PyPI stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in
|
||||
vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version or branch.
|
||||
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in
|
||||
vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in any vLLM Ascend version or branch.
|
||||
|
||||
@@ -26,7 +26,7 @@ The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) throug
|
||||
|
||||
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
|
||||
|
||||
When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
|
||||
When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
|
||||
|
||||
### 1. Combining KV Cache Pool with HBM Prefix Caching
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
|
||||
Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.
|
||||
|
||||
2. **Optimizing TPOT**
|
||||
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
|
||||
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system's **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
|
||||
|
||||
---
|
||||
|
||||
@@ -28,7 +28,7 @@ For step-by-step deployment and configuration, refer to the following guide:
|
||||
|
||||
### 1. Design Approach
|
||||
|
||||
Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key–value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
|
||||
Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key-value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
|
||||
|
||||
### 2. Implementation Design
|
||||
|
||||
@@ -38,19 +38,19 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
|
||||
#### Mooncake Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
1. The request is sent to the Proxy's `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
|
||||
3. After the P node's scheduler finishes prefill, `update_from_output` invokes the schedule connector's `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
|
||||
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
|
||||
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
|
||||
|
||||
#### Mooncake Layerwise Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
1. The request is sent to the Proxy's `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
|
||||
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
|
||||
4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
|
||||
4. The Proxy's `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
5. During processing, the P node's scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
|
||||
6. The D node performs decoding and returns the result.
|
||||
|
||||
### 3. Interface Design
|
||||
@@ -63,7 +63,7 @@ Taking MooncakeConnector as an example, the system is organized into three prima
|
||||
|
||||
### 4. Specifications Design
|
||||
|
||||
This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes.
|
||||
This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving equal TP setups and certain unequal TP setups across multiple P and D nodes.
|
||||
|
||||
| Feature | Status |
|
||||
|-------------------------------|----------------|
|
||||
|
||||
@@ -236,7 +236,7 @@ All method arguments must specify parameter types and default values, and functi
|
||||
|
||||
#### Expert Map
|
||||
|
||||
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
|
||||
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified of which ranks have inconsistent maps.
|
||||
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
|
||||
|
||||
#### Expert Weight
|
||||
|
||||
@@ -2,18 +2,18 @@
|
||||
|
||||
## How Does It Work?
|
||||
|
||||
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
This is an optimization based on FX graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
|
||||
You can get its code [code](https://gitcode.com/Ascend/torchair)
|
||||
|
||||
## Default Fx Graph Optimization
|
||||
## Default FX Graph Optimization
|
||||
|
||||
### Fx Graph pass
|
||||
### FX Graph pass
|
||||
|
||||
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
|
||||
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
|
||||
|
||||
### Fx fusion pass
|
||||
### FX fusion pass
|
||||
|
||||
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
|
||||
|
||||
|
||||
@@ -73,4 +73,4 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
## Limitations
|
||||
|
||||
1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for that version (e.g., v0.9.n) should work.
|
||||
|
||||
@@ -24,7 +24,7 @@ The `embedding` method is generally not implemented for quantization, focusing o
|
||||
|
||||
The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
|
||||
|
||||
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
|
||||
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **MoE (Mixture of Experts)**).
|
||||
|
||||
**Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
|
||||
|
||||
@@ -107,7 +107,7 @@ vLLM Ascend supports multiple quantization algorithms. The following table provi
|
||||
| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation |
|
||||
| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) |
|
||||
| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation |
|
||||
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
|
||||
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | We support two deployment modes: PD Colocation (dynamic quantization for both P and D) and PD Disaggregation (dynamic-quant P and static-quant D) |
|
||||
|
||||
**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
|
||||
|
||||
|
||||
@@ -34,11 +34,10 @@ bash format.sh
|
||||
|
||||
#### Run CI locally
|
||||
|
||||
After completing "Run lint" setup, you can run CI locally:
|
||||
After completing "Run lint" setup, you can run CI (Continuous integration) locally:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
cd ~/vllm-project/
|
||||
|
||||
# Run CI needs vLLM installed
|
||||
@@ -75,7 +74,7 @@ You can refer to [Testing](./testing.md) to set up a testing environment and ru
|
||||
|
||||
## DCO and Signed-off-by
|
||||
|
||||
When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO.
|
||||
When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO (Developer Certificate of Origin).
|
||||
|
||||
Using `-s` with `git commit` will automatically add this header.
|
||||
|
||||
|
||||
@@ -119,7 +119,7 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
|
||||
|
||||
See <https://lws.sigs.k8s.io/docs/installation/> Which can be used as a reference
|
||||
|
||||
- Step 2. Deploy the following yaml file `lws.yaml` as what you want
|
||||
- Step 2. Deploy the following yaml file `lws.yaml` as needed
|
||||
|
||||
```yaml
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
|
||||
@@ -161,7 +161,7 @@ There are several arguments that you should update according to your environment
|
||||
- `path`: Update to your model weight path.
|
||||
- `model`: Update to your model name in vLLM.
|
||||
- `host_ip` and `host_port`: Update to your vLLM server ip and port.
|
||||
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
|
||||
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max_model_len`(config in your vllm server), `32768` will be suitable for most datasets.
|
||||
- `batch_size`: Update according to your dataset.
|
||||
- `temperature`: Update inference argument.
|
||||
|
||||
|
||||
@@ -65,7 +65,7 @@ pip install gradio plotly evalscope
|
||||
|
||||
## 3. Run GSM8K using EvalScope for accuracy testing
|
||||
|
||||
You can use `evalscope eval` to run GSM8K for accuracy testing:
|
||||
You can use `evalscope eval` to run GSM8K (a grade-school math benchmark dataset) for accuracy testing:
|
||||
|
||||
```shell
|
||||
evalscope eval \
|
||||
|
||||
@@ -43,7 +43,7 @@ INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
### 2. Run GSM8K using lm-eval for accuracy testing
|
||||
### 2. Run GSM8K using the vLLM server (curl) and then run lm-eval for accuracy testing
|
||||
|
||||
You can query the result with input prompts:
|
||||
|
||||
|
||||
@@ -53,7 +53,7 @@ curl http://localhost:8000/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
## 2. Run C-Eval using OpenCompass for accuracy testing
|
||||
## 2. Run C-Eval (a Chinese language model evaluation benchmark) using OpenCompass for accuracy testing
|
||||
|
||||
Install OpenCompass and configure the environment variables in the container:
|
||||
|
||||
|
||||
@@ -88,7 +88,7 @@ Navigate to the `./vllm_profile` directory and locate the generated `*ascend_pt`
|
||||
|
||||
```python
|
||||
from torch_npu.profiler.profiler import analyse
|
||||
analyse("./vllm_profile/localhost.localdomain_XXXXXXXXXX_ascend_pt/")
|
||||
analyse("./vllm_profile/localhost.localdomain_*_ascend_pt/")
|
||||
```
|
||||
|
||||
### 5. View Results
|
||||
|
||||
@@ -146,7 +146,7 @@ In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynam
|
||||
|
||||
### 13. Failed to enable NPU graph mode when running DeepSeek
|
||||
|
||||
Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.
|
||||
Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA (Multi-Head Latent Attention) and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.
|
||||
|
||||
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, `num_heads`/`num_kv_heads` is {32, 64, 128}.
|
||||
|
||||
|
||||
@@ -244,7 +244,7 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developer immediately take place changes without requiring a new installation.
|
||||
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developers immediately make changes without requiring a new installation.
|
||||
|
||||
## Extra information
|
||||
|
||||
|
||||
@@ -302,7 +302,7 @@ bash proxy.sh
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 is common setting for prefill context parallelism (PCP) sizes.
|
||||
- `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
|
||||
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs and 16 chips to deploy one service instance.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
@@ -735,14 +735,14 @@ vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
### Start the service
|
||||
|
||||
```bash
|
||||
# on 190.0.0.1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.2
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.3
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.4
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 192.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.2
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 192.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.3
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 192.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 192.0.0.4
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 192.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
|
||||
## Example Proxy for Deployment
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
## Getting Started
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide provides step-by-step instructions to verify this features in resource-constrained environments.
|
||||
|
||||
Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1.
|
||||
Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture (one Prefiller and one Decoder on the same node). Assume the IP address is 192.0.0.1.
|
||||
|
||||
## Verify Communication Environment
|
||||
|
||||
|
||||
@@ -257,7 +257,7 @@ Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes)
|
||||
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
|
||||
|
||||
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
|
||||
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_online_dp.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
|
||||
|
||||
1. `launch_online_dp.py` to launch external dp vllm servers.
|
||||
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
|
||||
|
||||
@@ -22,7 +22,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
- `GLM-4.6`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6).
|
||||
- `GLM-4.7`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7).
|
||||
- `GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8).
|
||||
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm do not support GLM4.6 mtp in October, so we do not provide mtp version. And last month, it supported, you can use the following quantization scheme to add mtp weights to Quantized weights.
|
||||
- `GLM-4.6-w8a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because vllm does not support GLM4.6 mtp in October, we do not provide an mtp version. Last month, it was supported; you can use the following quantization scheme to add mtp weights to the quantized weights.
|
||||
- `GLM-4.7-w8a8-with-float-mtp`(Quantized version without mtp): [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8-floatmtp).
|
||||
- `Method of Quantify`: [quantization scheme](https://blog.csdn.net/qq_37368095/article/details/156429653?spm=1011.2124.3001.6209). You can use these methods to quantify the model.
|
||||
|
||||
@@ -38,7 +38,7 @@ You can use our official docker image to run `GLM-4.x` directly.
|
||||
::::{tab-item} A3 series
|
||||
:sync: A3
|
||||
|
||||
Start the docker image on your each node.
|
||||
Start the docker image on each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks.
|
||||
[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts (MoE) architecture and targets at complex systems engineering and long-horizon agentic tasks.
|
||||
|
||||
The `GLM-5` model is first supported in `vllm-ascend:v0.17.0rc1`. In `vllm-ascend:v0.17.0rc1` and `vllm-ascend:v0.18.0rc1` , the version of transformers need to be upgraded to 5.2.0.
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output.
|
||||
Qwen3-Omni is a native end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, which contains the thinker component, is equipped with chain-of-thought reasoning and supports audio, video, and text input, with text output.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
|
||||
|
||||
|
||||
@@ -130,14 +130,14 @@ model_name = "Qwen/Qwen3-VL-Reranker-8B"
|
||||
|
||||
# What is the difference between the official original version and one
|
||||
# that has been converted into a sequence classification model?
|
||||
# Qwen3-Reranker is a language model that doing reranker by using the
|
||||
# Qwen3-VL-Reranker is a language model that doing reranker by using the
|
||||
# logits of "no" and "yes" tokens.
|
||||
# It needs to computing 151669 tokens logits, making this method extremely
|
||||
# inefficient, not to mention incompatible with the vllm score API.
|
||||
# It needs to compute 151669 tokens logits, making this method extremely
|
||||
# inefficient, not to mention incompatible with the vLLM score API.
|
||||
# A method for converting the original model into a sequence classification
|
||||
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
|
||||
# Models converted offline using this method can not only be more efficient
|
||||
# and support the vllm score API, but also make the init parameters more
|
||||
# and support the vLLM score API, but also make the init parameters more
|
||||
# concise, for example.
|
||||
# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling")
|
||||
|
||||
@@ -163,9 +163,9 @@ model = LLM(
|
||||
)
|
||||
|
||||
# Why do we need hf_overrides for the official original version:
|
||||
# vllm converts it to Qwen3VLForSequenceClassification when loaded for
|
||||
# vLLM converts it to Qwen3VLForSequenceClassification when loaded for
|
||||
# better performance.
|
||||
# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],`
|
||||
# - Firstly, we need to use `"architectures": ["Qwen3VLForSequenceClassification"],`
|
||||
# to manually route to Qwen3VLForSequenceClassification.
|
||||
# - Then, we will extract the vector corresponding to classifier_from_token
|
||||
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
|
||||
|
||||
@@ -514,7 +514,7 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl
|
||||
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
|
||||
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
|
||||
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
|
||||
- `no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
|
||||
- `--no-enable-prefix-caching`: The prefix-cache feature is enabled by default. You can use the `--no-enable-prefix-caching` parameter to disable this feature. Notice: for Prefill-Decode disaggregation feature, known issue on D node: [#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)
|
||||
|
||||
4. Run the `proxy.sh` script on the prefill master node
|
||||
|
||||
|
||||
@@ -95,7 +95,7 @@ Processed prompts: 100%|██████████████████
|
||||
|
||||
## Performance
|
||||
|
||||
Run performance of `Qwen3-Reranker-8B` as an example.
|
||||
Run performance of `Qwen3-Embedding-8B` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
@@ -95,14 +95,14 @@ model_name = "Qwen/Qwen3-Reranker-8B"
|
||||
|
||||
# What is the difference between the official original version and one
|
||||
# that has been converted into a sequence classification model?
|
||||
# Qwen3-Reranker is a language model that doing reranker by using the
|
||||
# Qwen3-Reranker is a language model that does reranker by using the
|
||||
# logits of "no" and "yes" tokens.
|
||||
# It needs to computing 151669 tokens logits, making this method extremely
|
||||
# inefficient, not to mention incompatible with the vllm score API.
|
||||
# It needs to compute 151669 tokens logits, making this method extremely
|
||||
# inefficient, not to mention incompatible with the vLLM score API.
|
||||
# A method for converting the original model into a sequence classification
|
||||
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
|
||||
# Models converted offline using this method can not only be more efficient
|
||||
# and support the vllm score API, but also make the init parameters more
|
||||
# and support the vLLM score API, but also make the init parameters more
|
||||
# concise, for example.
|
||||
# model = LLM(model="Qwen/Qwen3-Reranker-8B", task="score")
|
||||
|
||||
@@ -120,7 +120,7 @@ model = LLM(
|
||||
)
|
||||
|
||||
# Why do we need hf_overrides for the official original version:
|
||||
# vllm converts it to Qwen3ForSequenceClassification when loaded for
|
||||
# vLLM converts it to Qwen3ForSequenceClassification when loaded for
|
||||
# better performance.
|
||||
# - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],`
|
||||
# to manually route to Qwen3ForSequenceClassification.
|
||||
|
||||
@@ -30,7 +30,7 @@ The following table lists additional configuration options available in vLLM Asc
|
||||
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
|
||||
| `finegrained_tp_config` | dict | `{}` | Configuration options for module tensor parallelism |
|
||||
| `ascend_compilation_config` | dict | `{}` | Configuration options for ascend compilation |
|
||||
| `eplb_config` | dict | `{}` | Configuration options for ascend compilation |
|
||||
| `eplb_config` | dict | `{}` | Configuration options for eplb |
|
||||
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
||||
| `dump_config_path` | str | `None` | Configuration file path for msprobe dump(eager mode). |
|
||||
| `enable_async_exponential` | bool | `False` | Whether to enable asynchronous exponential overlap. To enable asynchronous exponential, set this config to True. |
|
||||
|
||||
@@ -23,7 +23,7 @@ We will support other NPUs in the future.
|
||||
|
||||
## Software Requirements
|
||||
|
||||
Batch invariance requires a customed operator library for 910B.
|
||||
Batch invariance requires a custom operator library for 910B.
|
||||
We will release the customed operator library in future versions.
|
||||
|
||||
## Enabling Batch Invariance
|
||||
|
||||
@@ -67,12 +67,11 @@ sudo dnf install -y util-linux numactl procps-ng
|
||||
|
||||
### IRQ binding's additional considerations
|
||||
|
||||
For best results, if you run inside a docker container, which `systemctl` is likely unavailable, stop `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
|
||||
For best results, if you run inside a Docker container where `systemctl` is likely unavailable, stop the `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
|
||||
|
||||
- **Stop `irqbalance` service**:
|
||||
|
||||
For example, on Ubuntu system, you can run the following command to stop irqbalance:
|
||||
|
||||
```bash
|
||||
sudo systemctl stop irqbalance
|
||||
```
|
||||
|
||||
@@ -21,7 +21,7 @@ We are working on further improvements and this feature will support more XPUs i
|
||||
|
||||
### Tuning Parameters
|
||||
|
||||
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
|
||||
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
|
||||
|
||||
```python
|
||||
--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
Expert balancing for MoE (Mixture of Experts) models in LLM (Large Language) serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
|
||||
## EPLB Effects
|
||||
|
||||
|
||||
@@ -97,7 +97,7 @@ export PYTHONHASHSEED=0
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 800 I/T A3 series | HDK >= 26.0.0<br>CANN >= 9.0.0 | `export ASCEND_ENABLE_USE_FABRIC_MEM=1` | **Recommended**. Enables unified memory address direct transmission scheme. |
|
||||
| 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). |
|
||||
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series|
|
||||
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission scheme on 800 I/T A2 series|
|
||||
|
||||
### FAQ for HIXL (ascend_direct) backend
|
||||
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
## Overview
|
||||
|
||||
**Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
|
||||
**Layer Sharding Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
|
||||
|
||||
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
|
||||
Instead of replicating all weights on every device, **Layer Sharding Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
|
||||
|
||||
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
|
||||
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand** via asynchronous broadcast during the forward pass.
|
||||
@@ -23,13 +23,13 @@ This approach **preserves exact computational semantics** while **significantly
|
||||
|
||||

|
||||
|
||||
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes—enabling **zero-overhead** weight loading.
|
||||
> **Figure.** Layer Sharding Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes-enabling **zero-overhead** weight loading.
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
To enable **Layer Shard Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use:
|
||||
To enable **Layer Sharding Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use:
|
||||
|
||||
```bash
|
||||
--additional-config '{
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
As introduced in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715), this is a simple ACLGraph graph mode acceleration solution based on Fx graphs.
|
||||
|
||||
## Using npugraph_ex
|
||||
## Using Npugraph_ex
|
||||
|
||||
Npugraph_ex will be enabled by default in the future, Take Qwen series models as an example to show how to configure it.
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ Since the generation and training phases may employ different model parallelism
|
||||
|
||||
## Getting started
|
||||
|
||||
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vLLM is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
|
||||
The engine (v0/v1) supports two sleep levels to manage memory during idle periods:
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ Since we use vector computations to hide the weight prefetching pipeline, this h
|
||||
|
||||
## Quick Start
|
||||
|
||||
With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to open weight prefetch.
|
||||
Use `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to enable weight prefetch.
|
||||
|
||||
## Fine-tune Prefetch Ratio
|
||||
|
||||
|
||||
@@ -71,10 +71,10 @@ vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
|
||||
`HCCL_EXEC_TIMEOUT`, `HCCL_CONNECT_TIMEOUT`, and `HCCL_IF_IP` are hccl-related configurations.<br>
|
||||
Set `GLOO_SOCKET_IFNAME`, `TP_SOCKET_IFNAME`, and `HCCL_SOCKET_IFNAME` to the corresponding NIC.<br>
|
||||
`ASCEND_RT_VISIBLE_DEVICES` specifies the cards on which the node run resides. The total number of cards equals `dp_size*tp_size`.<br>
|
||||
`/xxxxx/DeepSeek-V2-Lite-Chat` is configured as a model that requires run.<br>
|
||||
`/xxxxx/DeepSeek-V2-Lite-Chat` is configured as a model that requires running.<br>
|
||||
`--host`: indicates the IP address of the node to be started.<br>
|
||||
`--port`: indicates the port to be started, which corresponds to the port in step 4.<br>
|
||||
`--seed`, --max-model-len, and --max-num-batched-tokens model basic configuration. Set this parameter based on the site requirements.<br>
|
||||
`--port`: indicates the port on which the prefill node will listen (e.g., 8100). This port is later referenced in step 3 when configuring the proxy server.<br>
|
||||
`--seed`: `--max-model-len`, and `--max-num-batched-tokens` are part of the model's basic configuration. Set this parameter based on the site requirements.<br>
|
||||
`--tensor-parallel-size`: specifies the TP size.<br>
|
||||
`--data-parallel-size`: indicates the DP size.<br>
|
||||
`--data-parallel-address`: indicates the IP address of the DP. Set this parameter to the IP address of the node.--data-parallel-rpc-port: indicates the RPC port for communication in the DP group.<br>
|
||||
@@ -144,7 +144,7 @@ python load_balance_proxy_server_example.py --host localhost --prefiller-hosts h
|
||||
|
||||
`--host`: indicates the active node. The value of localhost in the curl command delivered in step 5 must be the same as the host. The default port number for starting the service proxy is 8000.<br>
|
||||
`--prefiller-hosts`: Set this parameter to the IP addresses of all p nodes. In the xpyd scenario, add the IP addresses to the end of this configuration item and leave a blank space between the IP addresses.<br>
|
||||
`--prefiller-ports`: Set this parameter to the port number of all p nodes, which is the configuration of the port number for the vllm to start the service in step 3. Write the port number after the configuration in sequence and leave a blank space between the port number and the port number. The sequence must be one-to-one mapping to the IP address of --prefiller-hosts.<br>
|
||||
`--prefiller-ports`: Set this parameter to the port numbers of all prefill (P) nodes, which were defined in step 1 when starting the prefill nodes. Write the port number after the configuration in sequence and leave a blank space between the port number and the port number. The sequence must be one-to-one mapping to the IP address of --prefiller-hosts.<br>
|
||||
`--decoder-hosts`: Set this parameter to the IP addresses of all d nodes. In the xpyd scenario, add the IP addresses to the end of this configuration item and leave a blank space between the IP addresses.<br>
|
||||
`--decoder-ports`: Set this parameter to the port number of all d nodes, which is the configuration of the port number for the vllm to start the service in step 4. Set port to the end of the configuration, and leave a blank space between port and port. The sequence must be one-to-one mapping to the IP address of --decoder-hosts.<br>
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
|
||||
## run
|
||||
|
||||
The EPD disaggregated technology accelerates model inference by decoupling the visual encoding computation and LLM computation stages. Currently, the EPD separation feature can achieve different data transmissions between E and P/PD nodes by configuring different connector backends. Vllm-ascend currently supports the ECexample-connector backend implemented on vllm, and will support Mooncake as well as shared memory(SHM) backend transmission methods in the future.
|
||||
The EPD disaggregated technology accelerates model inference by decoupling the visual encoding computation and LLM computation stages. Currently, the EPD separation feature can achieve different data transmissions between E and P/PD nodes by configuring different connector backends. Vllm-ascend currently supports the ECExampleConnector backend implemented on vllm, and will support Mooncake as well as shared memory(SHM) backend transmission methods in the future.
|
||||
|
||||
### ECexample-connector deployment guide
|
||||
|
||||
@@ -114,7 +114,7 @@ python3 epd_load_balance_proxy_layerwise_server_example.py \
|
||||
--port 8001
|
||||
```
|
||||
|
||||
TODO: explain the param.<br>
|
||||
The parameters are explained as follows:<br>
|
||||
`--encoder-hosts`: E node IP address.<br>
|
||||
`--encoder-ports`: The E node port number. It needs to be consistent with the --port in the E node's startup script.<br>
|
||||
`--pd-hosts`: PD node IP address.<br>
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
Here is an example guiding how to use `launch_online_dp.py` to launch external dp vllm servers. User can easily launch external dp servers following the steps below:
|
||||
Here is an example guiding how to use `launch_online_dp.py` to launch external dp vLLM servers. User can easily launch external dp servers following the steps below:
|
||||
|
||||
### Modify parameters in `run_dp_template.sh`
|
||||
|
||||
`run_dp_template.sh` is an template script used to launch each dp vllm instance separately. It will be called by `launch_online_dp.py` in multi threads and most of its configurations are set by `launch_online_dp.py`. Parameters you need to set manually include:
|
||||
`run_dp_template.sh` is a template script used to launch each data parallel (dp) vLLM instance separately. It will be called by `launch_online_dp.py` in multiple threads and most of its configurations are set by `launch_online_dp.py`. Parameters you need to set manually include:
|
||||
|
||||
1. The IP and socket_ifname of your machine. If running on multi-nodes, please make sure the scripts on each node has been set with correct IP and socket_ifname of that node.
|
||||
2. vLLM serving related parameters including model_path and other configurations. Note that port, dp-related parameters and tp_size is set by `launch_online_dp.py`, all the other vLLM parameters in this file only serve as an example and you are free to modify them according to your purpose.
|
||||
@@ -42,7 +42,7 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s
|
||||
|
||||
### (Optional) Run `dp_load_balance_proxy_server.py` to load balance requests between external dp servers
|
||||
|
||||
External dp server means that you need to handle load balance between multiple dp instances out of vllm by implementing your custom proxy server. Here we provide an example of request-length-aware dp load-balance proxy server for you. The arguments of `dp_load_balance_proxy_server.py` include:
|
||||
External dp server means that you need to handle load balance between multiple dp instances out of vLLM by implementing your custom proxy server. Here we provide an example of request-length-aware dp load-balance proxy server for you. The arguments of `dp_load_balance_proxy_server.py` include:
|
||||
|
||||
1. `--port`: port of proxy server, default 8000
|
||||
2. `--host`: host address of proxy server, default localhost
|
||||
|
||||
@@ -30,5 +30,5 @@ lm_eval --model {{ model_type }} --model_args $MODEL_ARGS \
|
||||
| Task | Metric | Value | Stderr |
|
||||
|-----------------------|-------------|----------:|-------:|
|
||||
{% for row in rows -%}
|
||||
| {{ row.task }} | {{ row.metric }} | {{ row.value }} | ± {{ "%.4f" | format(row.stderr | float) }} |
|
||||
| {{ row.task }} | {{ row.metric }} | {{ row.value }} | ± {{ "%.4f" \| format(row.stderr \| float) }} |
|
||||
{% endfor %}
|
||||
|
||||
Reference in New Issue
Block a user