[Doc] Sensitive word modification (#8303)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR updates the documentation to replace specific hardware terms
(e.g., HBM, 910B, 310P) with more generic or branded terms (e.g.,
on-chip memory, Atlas inference products) to comply with sensitive word
requirements.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-17 16:30:00 +08:00
committed by GitHub
parent 9c1d58f4d2
commit 76cc2204bd
11 changed files with 31 additions and 31 deletions

View File

@@ -4,9 +4,9 @@
Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically. Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.
However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses HBM for KV cache storage. However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses on-chip memory for KV cache storage.
Hence, KV Cache Pool is proposed to utilize various types of storage including HBM, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests. Hence, KV Cache Pool is proposed to utilize various types of storage including on-chip memory, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines. vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.
@@ -22,26 +22,26 @@ For step-by-step deployment and configuration, please refer to the [KV Pool User
## How it works? ## How it works?
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture. The KV Cache Pool integrates multiple memory tiers (on-chip memory, DRAM, SSD, etc.) through a connector-based architecture.
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth. Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory. When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in on-chip memory) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
### 1. Combining KV Cache Pool with HBM Prefix Caching ### 1. Combining KV Cache Pool with on-chip memory Prefix Caching
Prefix Caching with HBM is already supported by the vLLM V1 Engine. Prefix Caching with on-chip memory is already supported by the vLLM V1 Engine.
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool. By introducing KV Connector V1, users can seamlessly combine on-chip memory-based Prefix Caching with Mooncake-backed KV Pool.
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector). The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector).
**Workflow**: **Workflow**:
1. The engine first checks for prefix hits in the HBM cache. 1. The engine first checks for prefix hits in the on-chip memory cache.
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency. 2. After getting the number of hit tokens on on-chip memory, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from on-chip memory to minimize the data transfer latency.
3. After the KV Caches in the KV Pool are loaded into HBM, the remaining process is the same as Prefix Caching in HBM. 3. After the KV Caches in the KV Pool are loaded into on-chip memory, the remaining process is the same as Prefix Caching in on-chip memory.
### 2. Combining KV Cache Pool with Mooncake PD Disaggregation ### 2. Combining KV Cache Pool with Mooncake PD Disaggregation
@@ -49,7 +49,7 @@ When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cach
Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector. Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector.
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly. The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from on-chip memory and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.
To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order. To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.

View File

@@ -36,7 +36,7 @@ On multisocket ARM systems, the OS scheduler may place vLLM threads on CPUs f
- Read cpuset from /proc/self/status. - Read cpuset from /proc/self/status.
- Read topo affinity from `npusmi info -t topo`. - Read topo affinity from `npusmi info -t topo`.
4. **Build CPU pools**: 4. **Build CPU pools**:
- Use **global_slice** for A3 devices; **topo_affinity** for A2 and 310P. - Use **global_slice** for A3 devices; **topo_affinity** for A2 and Atlas 300 inference products.
- If topo affinity is missing, fall back to global_slice. - If topo affinity is missing, fall back to global_slice.
- Ensure each NPU has at least 5 CPUs. - Ensure each NPU has at least 5 CPUs.
5. **Allocate perrole CPUs**: 5. **Allocate perrole CPUs**:

View File

@@ -136,9 +136,9 @@ The problem is usually caused by the installation of a development or editable v
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory). OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory).
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this: In scenarios where NPUs have limited high bandwidth memory (on-chip memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step. - **Limit `--max-model-len`**: It can save the on-chip memory usage for KV cache initialization step.
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization). - **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization).

View File

@@ -283,8 +283,8 @@ Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
### Preparation for Step 2 ### Preparation for Step 2
Before Step 2, send a fully random Dataset B to Instance 1. Due to the Before Step 2, send a fully random Dataset B to Instance 1. Due to the
unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy, unified on-chip memory/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's Dataset B's cache evicts Dataset A's cache from on-chip memory, leaving Dataset A's
cache only in Node 1's DRAM. cache only in Node 1's DRAM.
### Step 2: Local DRAM Hit ### Step 2: Local DRAM Hit

View File

@@ -48,7 +48,7 @@ docker run --rm \
``` ```
:::{note} :::{note}
The 310P device is supported from version 0.15.0rc1. You need to select the corresponding image for installation. The Atlas 300 inference products are supported from version 0.15.0rc1. You need to select the corresponding image for installation.
::: :::
## Deployment ## Deployment
@@ -57,7 +57,7 @@ The 310P device is supported from version 0.15.0rc1. You need to select the corr
#### Single NPU (PaddleOCR-VL) #### Single NPU (PaddleOCR-VL)
PaddleOCR-VL supports single-node single-card deployment on the 910B4 and 310P platform. Follow these steps to start the inference service: PaddleOCR-VL supports single-node single-card deployment on the 910B4 and Atlas 300 inference products platform. Follow these steps to start the inference service:
1. Prepare model weights: Ensure the downloaded model weights are stored in the `PaddleOCR-VL` directory. 1. Prepare model weights: Ensure the downloaded model weights are stored in the `PaddleOCR-VL` directory.
2. Create and execute the deployment script (save as `deploy.sh`): 2. Create and execute the deployment script (save as `deploy.sh`):
@@ -90,10 +90,10 @@ vllm serve ${MODEL_PATH} \
``` ```
:::: ::::
::::{tab-item} 310P ::::{tab-item} Atlas 300 inference products
:sync: 310P :sync: Atlas 300 inference products
Run the following script to start the vLLM server on single 310P: Run the following script to start the vLLM server on single Atlas 300 inference products:
```shell ```shell
#!/bin/sh #!/bin/sh
@@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \
``` ```
:::{note} :::{note}
The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the 310P device. The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products.
::: :::
:::: ::::
@@ -260,7 +260,7 @@ The 910B4 device supports inference using the PaddlePaddle framework.
::::{tab-item} OM inference ::::{tab-item} OM inference
:sync: om :sync: om
The 310P device supports only the OM model inference. For details about the process, see the guide provided in [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2). The Atlas 300 inference products support only the OM model inference. For details about the process, see the guide provided in [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2).
:::: ::::
::::: :::::

View File

@@ -328,7 +328,7 @@ vllm serve Qwen/Qwen3-VL-8B-Instruct \
``` ```
:::{note} :::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series. Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
::: :::
If your service start successfully, you can see the info shown below: If your service start successfully, you can see the info shown below:
@@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
``` ```
:::{note} :::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series. Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series.
::: :::
If your service start successfully, you can see the info shown below: If your service start successfully, you can see the info shown below:

View File

@@ -90,7 +90,7 @@ vllm serve deepseek-ai/DeepSeek-R1 \
## Experimental Results ## Experimental Results
To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows. To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend Atlas A2 inference products*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows.
| Module | Memory Savings | TPOT Impact (batch=24) | | Module | Memory Savings | TPOT Impact (batch=24) |
| ---------------- | -------------- | ------------------------- | | ---------------- | -------------- | ------------------------- |

View File

@@ -18,12 +18,12 @@ Batch invariance is crucial for several use cases:
## Hardware Requirements ## Hardware Requirements
Batch invariance currently requires Ascend 910B NPUs, because only the 910B supports batch invariance with HCCL communication for now. Batch invariance currently requires Ascend Atlas A2 inference products NPUs, because only the Atlas A2 inference products supports batch invariance with HCCL communication for now.
We will support other NPUs in the future. We will support other NPUs in the future.
## Software Requirements ## Software Requirements
Batch invariance requires a custom operator library for 910B. Batch invariance requires a custom operator library for Atlas A2 inference products.
We will release the customed operator library in future versions. We will release the customed operator library in future versions.
## Enabling Batch Invariance ## Enabling Batch Invariance

View File

@@ -79,7 +79,7 @@ The total world size is `tensor_parallel_size` * `prefill_context_parallel_size`
## Experimental Results ## Experimental Results
To evaluate the effectiveness of Context Parallel in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows. To evaluate the effectiveness of Context Parallel in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD disaggregate instances in the environment of 64 cards Ascend Atlas A3 inference products*64G (A3), the configuration and performance data are as follows.
- DeepSeek-R1-W8A8: - DeepSeek-R1-W8A8:

View File

@@ -93,5 +93,5 @@ After startup, you can test consistency by issuing inference requests with tempe
## Note & Caveats ## Note & Caveats
- If Netloader is used, **each worker process** must bind a listening port. That port may be user-specified or assigned randomly. If user-specified, ensure it is available. - If Netloader is used, **each worker process** must bind a listening port. That port may be user-specified or assigned randomly. If user-specified, ensure it is available.
- Netloader requires extra HBM memory to establish HCCL connections (i.e. `HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient capacity (e.g. via `--gpu-memory-utilization`). - Netloader requires extra on-chip memory memory to establish HCCL connections (i.e. `HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient capacity (e.g. via `--gpu-memory-utilization`).
- It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or slow connections/transmissions. Related info: [vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226). - It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or slow connections/transmissions. Related info: [vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226).

View File

@@ -2,7 +2,7 @@
## Overview ## Overview
Unified Cache Management (UCM) provides an external KV-cache storage layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike KV Pooling, which expands prefix-cache capacity only by aggregating device memory and therefore remains limited by HBM/DRAM size and lacks persistence, UCM decouples compute from storage and adopts a tiered design. Each node uses local DRAM as a fast cache, while a shared backend—such as 3FS or enterprise-grade storage—serves as the persistent KV store. This approach removes the capacity ceiling imposed by device memory, enables durable and reliable prefix caching, and allows cache capacity to scale with the storage system rather than with compute resources. Unified Cache Management (UCM) provides an external KV-cache storage layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike KV Pooling, which expands prefix-cache capacity only by aggregating device memory and therefore remains limited by on-chip memory/DRAM size and lacks persistence, UCM decouples compute from storage and adopts a tiered design. Each node uses local DRAM as a fast cache, while a shared backend—such as 3FS or enterprise-grade storage—serves as the persistent KV store. This approach removes the capacity ceiling imposed by device memory, enables durable and reliable prefix caching, and allows cache capacity to scale with the storage system rather than with compute resources.
## Prerequisites ## Prerequisites