[Doc][Misc] Correcting the document and uploading the model deployment template (#8287)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Correcting the document and uploading the model deployment template

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-15 16:03:11 +08:00
committed by GitHub
parent 147b589f62
commit 95726d20eb
31 changed files with 536 additions and 308 deletions

View File

@@ -10,13 +10,13 @@ Hence, KV Cache Pool is proposed to utilize various types of storage including H
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.
While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware.
While one can utilize MooncakeStore in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports MooncakeStore and can utilize the data transfer strategy that best fits Huawei NPU hardware.
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
Hence, we propose to integrate MooncakeStore with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
## Usage
vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
vLLM Ascend currently supports MooncakeStore for KV Cache Pool. To enable MooncakeStore, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
@@ -51,7 +51,7 @@ Currently, we only perform put and get operations of KV Pool for **Prefill Nodes
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.
To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
For details, please also refer to the Mooncake Connector Store Deployment Guide.
@@ -86,6 +86,6 @@ The KV Connector methods that need to be implemented can be categorized into sch
## Limitations
1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
1. Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. We will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that).

View File

@@ -35,7 +35,7 @@ The workflow of obtaining inputs:
At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
**Note**: The `Token IDs` are the inputs of a model, so we also call them `Input IDs`.
### 2. Build inputs attention metadata

View File

@@ -60,7 +60,7 @@ Before writing a patch, following the principle above, we should patch the least
# 1. `<The target patch module in vLLM>`
# Why:
# <Describe the reason why we need to patch>
# How
# How:
# <Describe the way to patch>
# Related PR (if no, explain why):
# <Add a link to the related PR in vLLM. If there is no related PR, explain why>

View File

@@ -54,7 +54,7 @@ Based on the above content, we present a brief description of the adaptation pro
- **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.
```python
from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme
from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme, AscendMoEScheme
@register_scheme("W4A8_DYNAMIC", "linear")
class AscendW4A8DynamicLinearMethod(AscendLinearScheme):

View File

@@ -55,14 +55,14 @@ From the workflow perspective, we can see how the final test script is executed,
deployment:
-
envs:
# fill with envs like: <key>:<value>
# fill with envs like: <key>:<value>
server_cmd: >
vllm serve ...
vllm serve ...
-
envs:
# fill with envs like: <key>:<value>
# fill with envs like: <key>:<value>
server_cmd: >
vllm serve ...
vllm serve ...
benchmarks:
perf:
# fill with performance test kwargs
@@ -76,34 +76,34 @@ Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml]
```yaml
multi-node-tests:
name: multi-node
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
strategy:
name: multi-node
if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
strategy:
fail-fast: false
max-parallel: 1
matrix:
test_config:
test_config:
- name: multi-node-deepseek-pd
config_file_path: DeepSeek-V3.yaml
size: 2
config_file_path: DeepSeek-V3.yaml
size: 2
- name: multi-node-qwen3-dp
config_file_path: Qwen3-235B-A22B.yaml
size: 2
config_file_path: Qwen3-235B-A22B.yaml
size: 2
- name: multi-node-qwenw8a8-2node
config_file_path: Qwen3-235B-W8A8.yaml
size: 2
config_file_path: Qwen3-235B-W8A8.yaml
size: 2
- name: multi-node-qwenw8a8-2node-eplb
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
size: 2
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
with:
config_file_path: Qwen3-235B-W8A8-EPLB.yaml
size: 2
uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
with:
soc_version: a3
runner: linux-aarch64-a3-0
image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
replicas: 1
size: ${{ matrix.test_config.size }}
config_file_path: ${{ matrix.test_config.config_file_path }}
secrets:
secrets:
KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
```
@@ -179,14 +179,14 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
name: dshm
volumes:
- name: dshm
emptyDir:
emptyDir:
medium: Memory
sizeLimit: 15Gi
- name: shared-volume
persistentVolumeClaim:
persistentVolumeClaim:
claimName: nv-action-vllm-benchmarks-v2
- name: driver-tools
hostPath:
hostPath:
path: /usr/local/Ascend/driver/tools
workerTemplate:
spec:
@@ -224,14 +224,14 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
name: dshm
volumes:
- name: dshm
emptyDir:
emptyDir:
medium: Memory
sizeLimit: 15Gi
- name: shared-volume
persistentVolumeClaim:
persistentVolumeClaim:
claimName: nv-action-vllm-benchmarks-v2
- name: driver-tools
hostPath:
hostPath:
path: /usr/local/Ascend/driver/tools
---
apiVersion: v1

View File

@@ -332,7 +332,6 @@ An L0 `dump.json` contains forward I/O for modules together with parameters. Usi
"data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
}
}
},
}
}
}
@@ -389,7 +388,6 @@ An L1 `dump.json` records forward I/O for APIs. Using PyTorch's `relu` function
"data_name": "Functional.relu.0.forward.output.0.pt"
}
]
},
}
}
}

View File

@@ -111,7 +111,7 @@ sudo apt update
sudo apt install libjemalloc2
# Configure jemalloc
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
```
#### 2.2. Tcmalloc

View File

@@ -97,7 +97,8 @@ For local `dataset-path`, please set `hf-name` to its Hugging Face ID like
First start serving your model:
```bash
VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen3-8B
```
Then run the benchmarking script:
@@ -158,7 +159,7 @@ vllm bench throughput \
If successful, you will see the following output
```shell
Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 toks/s
Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
Total num prompt tokens: 1280
Total num output tokens: 1280