From 76cc2204bdda578803b761d0740356e331658230 Mon Sep 17 00:00:00 2001 From: herizhen <59841270+herizhen@users.noreply.github.com> Date: Fri, 17 Apr 2026 16:30:00 +0800 Subject: [PATCH] [Doc] Sensitive word modification (#8303) ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> --- .../Design_Documents/KV_Cache_Pool_Guide.md | 22 +++++++++---------- .../Design_Documents/cpu_binding.md | 2 +- docs/source/faqs.md | 4 ++-- .../pd_colocated_mooncake_multi_instance.md | 4 ++-- docs/source/tutorials/models/PaddleOCR-VL.md | 14 ++++++------ docs/source/tutorials/models/Qwen-VL-Dense.md | 4 ++-- .../feature_guide/Fine_grained_TP.md | 2 +- .../feature_guide/batch_invariance.md | 4 ++-- .../feature_guide/context_parallel.md | 2 +- .../user_guide/feature_guide/netloader.md | 2 +- .../feature_guide/ucm_deployment.md | 2 +- 11 files changed, 31 insertions(+), 31 deletions(-) diff --git a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md index 45e292cb..54c70898 100644 --- a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md +++ b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md @@ -4,9 +4,9 @@ Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically. -However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses HBM for KV cache storage. +However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses on-chip memory for KV cache storage. -Hence, KV Cache Pool is proposed to utilize various types of storage including HBM, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests. +Hence, KV Cache Pool is proposed to utilize various types of storage including on-chip memory, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests. vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines. @@ -22,26 +22,26 @@ For step-by-step deployment and configuration, please refer to the [KV Pool User ## How it works? -The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture. +The KV Cache Pool integrates multiple memory tiers (on-chip memory, DRAM, SSD, etc.) through a connector-based architecture. Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth. -When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory. +When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in on-chip memory) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory. -### 1. Combining KV Cache Pool with HBM Prefix Caching +### 1. Combining KV Cache Pool with on-chip memory Prefix Caching -Prefix Caching with HBM is already supported by the vLLM V1 Engine. -By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool. +Prefix Caching with on-chip memory is already supported by the vLLM V1 Engine. +By introducing KV Connector V1, users can seamlessly combine on-chip memory-based Prefix Caching with Mooncake-backed KV Pool. The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector). **Workflow**: -1. The engine first checks for prefix hits in the HBM cache. +1. The engine first checks for prefix hits in the on-chip memory cache. -2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency. +2. After getting the number of hit tokens on on-chip memory, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from on-chip memory to minimize the data transfer latency. -3. After the KV Caches in the KV Pool are loaded into HBM, the remaining process is the same as Prefix Caching in HBM. +3. After the KV Caches in the KV Pool are loaded into on-chip memory, the remaining process is the same as Prefix Caching in on-chip memory. ### 2. Combining KV Cache Pool with Mooncake PD Disaggregation @@ -49,7 +49,7 @@ When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cach Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector. -The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly. +The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from on-chip memory and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly. To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order. diff --git a/docs/source/developer_guide/Design_Documents/cpu_binding.md b/docs/source/developer_guide/Design_Documents/cpu_binding.md index 5ea24e41..4fdff3e7 100644 --- a/docs/source/developer_guide/Design_Documents/cpu_binding.md +++ b/docs/source/developer_guide/Design_Documents/cpu_binding.md @@ -36,7 +36,7 @@ On multi‑socket ARM systems, the OS scheduler may place vLLM threads on CPUs f - Read cpuset from /proc/self/status. - Read topo affinity from `npu‑smi info -t topo`. 4. **Build CPU pools**: - - Use **global_slice** for A3 devices; **topo_affinity** for A2 and 310P. + - Use **global_slice** for A3 devices; **topo_affinity** for A2 and Atlas 300 inference products. - If topo affinity is missing, fall back to global_slice. - Ensure each NPU has at least 5 CPUs. 5. **Allocate per‑role CPUs**: diff --git a/docs/source/faqs.md b/docs/source/faqs.md index 97310ec5..8ba12db5 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -136,9 +136,9 @@ The problem is usually caused by the installation of a development or editable v OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory). -In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this: +In scenarios where NPUs have limited high bandwidth memory (on-chip memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this: -- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step. +- **Limit `--max-model-len`**: It can save the on-chip memory usage for KV cache initialization step. - **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization). diff --git a/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md b/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md index 9cffcfe2..15354a5a 100644 --- a/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md +++ b/docs/source/tutorials/features/pd_colocated_mooncake_multi_instance.md @@ -283,8 +283,8 @@ Send Dataset A to Instance 1 on Node 1 and record the Time to First Token ### Preparation for Step 2 Before Step 2, send a fully random Dataset B to Instance 1. Due to the -unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy, -Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's +unified on-chip memory/DRAM KV Cache with LRU (Least Recently Used) eviction policy, +Dataset B's cache evicts Dataset A's cache from on-chip memory, leaving Dataset A's cache only in Node 1's DRAM. ### Step 2: Local DRAM Hit diff --git a/docs/source/tutorials/models/PaddleOCR-VL.md b/docs/source/tutorials/models/PaddleOCR-VL.md index 8b9f1c54..c44dad6c 100644 --- a/docs/source/tutorials/models/PaddleOCR-VL.md +++ b/docs/source/tutorials/models/PaddleOCR-VL.md @@ -48,7 +48,7 @@ docker run --rm \ ``` :::{note} -The 310P device is supported from version 0.15.0rc1. You need to select the corresponding image for installation. +The Atlas 300 inference products are supported from version 0.15.0rc1. You need to select the corresponding image for installation. ::: ## Deployment @@ -57,7 +57,7 @@ The 310P device is supported from version 0.15.0rc1. You need to select the corr #### Single NPU (PaddleOCR-VL) -PaddleOCR-VL supports single-node single-card deployment on the 910B4 and 310P platform. Follow these steps to start the inference service: +PaddleOCR-VL supports single-node single-card deployment on the 910B4 and Atlas 300 inference products platform. Follow these steps to start the inference service: 1. Prepare model weights: Ensure the downloaded model weights are stored in the `PaddleOCR-VL` directory. 2. Create and execute the deployment script (save as `deploy.sh`): @@ -90,10 +90,10 @@ vllm serve ${MODEL_PATH} \ ``` :::: -::::{tab-item} 310P -:sync: 310P +::::{tab-item} Atlas 300 inference products +:sync: Atlas 300 inference products -Run the following script to start the vLLM server on single 310P: +Run the following script to start the vLLM server on single Atlas 300 inference products: ```shell #!/bin/sh @@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \ ``` :::{note} -The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the 310P device. +The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products. ::: :::: @@ -260,7 +260,7 @@ The 910B4 device supports inference using the PaddlePaddle framework. ::::{tab-item} OM inference :sync: om -The 310P device supports only the OM model inference. For details about the process, see the guide provided in [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2). +The Atlas 300 inference products support only the OM model inference. For details about the process, see the guide provided in [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2). :::: ::::: diff --git a/docs/source/tutorials/models/Qwen-VL-Dense.md b/docs/source/tutorials/models/Qwen-VL-Dense.md index 78aae7de..b42b1d15 100644 --- a/docs/source/tutorials/models/Qwen-VL-Dense.md +++ b/docs/source/tutorials/models/Qwen-VL-Dense.md @@ -328,7 +328,7 @@ vllm serve Qwen/Qwen3-VL-8B-Instruct \ ``` :::{note} -Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series. +Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. ::: If your service start successfully, you can see the info shown below: @@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \ ``` :::{note} -Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series. +Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. ::: If your service start successfully, you can see the info shown below: diff --git a/docs/source/user_guide/feature_guide/Fine_grained_TP.md b/docs/source/user_guide/feature_guide/Fine_grained_TP.md index 0a1ca68a..bdc4c696 100644 --- a/docs/source/user_guide/feature_guide/Fine_grained_TP.md +++ b/docs/source/user_guide/feature_guide/Fine_grained_TP.md @@ -90,7 +90,7 @@ vllm serve deepseek-ai/DeepSeek-R1 \ ## Experimental Results -To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows. +To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend Atlas A2 inference products*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows. | Module | Memory Savings | TPOT Impact (batch=24) | | ---------------- | -------------- | ------------------------- | diff --git a/docs/source/user_guide/feature_guide/batch_invariance.md b/docs/source/user_guide/feature_guide/batch_invariance.md index 9d6c0efe..2eb1ee56 100644 --- a/docs/source/user_guide/feature_guide/batch_invariance.md +++ b/docs/source/user_guide/feature_guide/batch_invariance.md @@ -18,12 +18,12 @@ Batch invariance is crucial for several use cases: ## Hardware Requirements -Batch invariance currently requires Ascend 910B NPUs, because only the 910B supports batch invariance with HCCL communication for now. +Batch invariance currently requires Ascend Atlas A2 inference products NPUs, because only the Atlas A2 inference products supports batch invariance with HCCL communication for now. We will support other NPUs in the future. ## Software Requirements -Batch invariance requires a custom operator library for 910B. +Batch invariance requires a custom operator library for Atlas A2 inference products. We will release the customed operator library in future versions. ## Enabling Batch Invariance diff --git a/docs/source/user_guide/feature_guide/context_parallel.md b/docs/source/user_guide/feature_guide/context_parallel.md index ad1c320b..abbda2af 100644 --- a/docs/source/user_guide/feature_guide/context_parallel.md +++ b/docs/source/user_guide/feature_guide/context_parallel.md @@ -79,7 +79,7 @@ The total world size is `tensor_parallel_size` * `prefill_context_parallel_size` ## Experimental Results -To evaluate the effectiveness of Context Parallel in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows. +To evaluate the effectiveness of Context Parallel in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD disaggregate instances in the environment of 64 cards Ascend Atlas A3 inference products*64G (A3), the configuration and performance data are as follows. - DeepSeek-R1-W8A8: diff --git a/docs/source/user_guide/feature_guide/netloader.md b/docs/source/user_guide/feature_guide/netloader.md index 3bb0384d..0a629153 100644 --- a/docs/source/user_guide/feature_guide/netloader.md +++ b/docs/source/user_guide/feature_guide/netloader.md @@ -93,5 +93,5 @@ After startup, you can test consistency by issuing inference requests with tempe ## Note & Caveats - If Netloader is used, **each worker process** must bind a listening port. That port may be user-specified or assigned randomly. If user-specified, ensure it is available. -- Netloader requires extra HBM memory to establish HCCL connections (i.e. `HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient capacity (e.g. via `--gpu-memory-utilization`). +- Netloader requires extra on-chip memory memory to establish HCCL connections (i.e. `HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient capacity (e.g. via `--gpu-memory-utilization`). - It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or slow connections/transmissions. Related info: [vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226). diff --git a/docs/source/user_guide/feature_guide/ucm_deployment.md b/docs/source/user_guide/feature_guide/ucm_deployment.md index 66d28a49..ad23fe5b 100644 --- a/docs/source/user_guide/feature_guide/ucm_deployment.md +++ b/docs/source/user_guide/feature_guide/ucm_deployment.md @@ -2,7 +2,7 @@ ## Overview -Unified Cache Management (UCM) provides an external KV-cache storage layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike KV Pooling, which expands prefix-cache capacity only by aggregating device memory and therefore remains limited by HBM/DRAM size and lacks persistence, UCM decouples compute from storage and adopts a tiered design. Each node uses local DRAM as a fast cache, while a shared backend—such as 3FS or enterprise-grade storage—serves as the persistent KV store. This approach removes the capacity ceiling imposed by device memory, enables durable and reliable prefix caching, and allows cache capacity to scale with the storage system rather than with compute resources. +Unified Cache Management (UCM) provides an external KV-cache storage layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike KV Pooling, which expands prefix-cache capacity only by aggregating device memory and therefore remains limited by on-chip memory/DRAM size and lacks persistence, UCM decouples compute from storage and adopts a tiered design. Each node uses local DRAM as a fast cache, while a shared backend—such as 3FS or enterprise-grade storage—serves as the persistent KV store. This approach removes the capacity ceiling imposed by device memory, enables durable and reliable prefix caching, and allows cache capacity to scale with the storage system rather than with compute resources. ## Prerequisites