[Doc][Misc] Correcting the document and uploading the model deployment template (#8287)

### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-15 16:03:11 +08:00
parent 147b589f62
commit 95726d20eb
31 changed files with 536 additions and 308 deletions
--- a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md
+++ b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md
@@ -10,13 +10,13 @@ Hence, KV Cache Pool is proposed to utilize various types of storage including H

 vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.

-While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware.
+While one can utilize MooncakeStore in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports MooncakeStore and can utilize the data transfer strategy that best fits Huawei NPU hardware.

-Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
+Hence, we propose to integrate MooncakeStore with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).

 ## Usage

-vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
+vLLM Ascend currently supports MooncakeStore for KV Cache Pool. To enable MooncakeStore, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.

 For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).

@@ -51,7 +51,7 @@ Currently, we only perform put and get operations of KV Pool for **Prefill Nodes

 The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.

-To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
+To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.

 For details, please also refer to the Mooncake Connector Store Deployment Guide.

@@ -86,6 +86,6 @@ The KV Connector methods that need to be implemented can be categorized into sch

 ## Limitations

-1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
+1. Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.

 2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. We will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that).
--- a/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md
+++ b/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md
@@ -35,7 +35,7 @@ The workflow of obtaining inputs:

 At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.

-**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.
+**Note**: The `Token IDs` are the inputs of a model, so we also call them `Input IDs`.

 ### 2. Build inputs attention metadata

--- a/docs/source/developer_guide/Design_Documents/patch.md
+++ b/docs/source/developer_guide/Design_Documents/patch.md
@@ -60,7 +60,7 @@ Before writing a patch, following the principle above, we should patch the least
    #   1. `<The target patch module in vLLM>`
    #    Why:
    #       <Describe the reason why we need to patch>
-    #    How：
+    #    How:
    #       <Describe the way to patch>
    #    Related PR (if no, explain why):
    #       <Add a link to the related PR in vLLM. If there is no related PR, explain why>
--- a/docs/source/developer_guide/Design_Documents/quantization.md
+++ b/docs/source/developer_guide/Design_Documents/quantization.md
@@ -54,7 +54,7 @@ Based on the above content, we present a brief description of the adaptation pro
 - **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.

 ```python
-from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme
+from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme, AscendMoEScheme

@register_scheme("W4A8_DYNAMIC", "linear")
 class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
--- a/docs/source/developer_guide/contribution/multi_node_test.md
+++ b/docs/source/developer_guide/contribution/multi_node_test.md
@@ -55,14 +55,14 @@ From the workflow perspective, we can see how the final test script is executed,
    deployment:
    -
        envs:
-            # fill with envs like: <key>:<value>
+        # fill with envs like: <key>:<value>
        server_cmd: >
-            vllm serve ...
+        vllm serve ...
    -
        envs:
-            # fill with envs like: <key>:<value>
+        # fill with envs like: <key>:<value>
        server_cmd: >
-            vllm serve ...
+        vllm serve ...
    benchmarks:
    perf:
        # fill with performance test kwargs
@@ -76,34 +76,34 @@ Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml]

   ```yaml
    multi-node-tests:
-        name: multi-node
-        if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
-        strategy:
+    name: multi-node
+    if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
+    strategy:
        fail-fast: false
        max-parallel: 1
        matrix:
-            test_config:
+        test_config:
            - name: multi-node-deepseek-pd
-                config_file_path: DeepSeek-V3.yaml
-                size: 2
+            config_file_path: DeepSeek-V3.yaml
+            size: 2
            - name: multi-node-qwen3-dp
-                config_file_path: Qwen3-235B-A22B.yaml
-                size: 2
+            config_file_path: Qwen3-235B-A22B.yaml
+            size: 2
            - name: multi-node-qwenw8a8-2node
-                config_file_path: Qwen3-235B-W8A8.yaml
-                size: 2
+            config_file_path: Qwen3-235B-W8A8.yaml
+            size: 2
            - name: multi-node-qwenw8a8-2node-eplb
-                config_file_path: Qwen3-235B-W8A8-EPLB.yaml
-                size: 2
-        uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
-        with:
+            config_file_path: Qwen3-235B-W8A8-EPLB.yaml
+            size: 2
+    uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
+    with:
        soc_version: a3
        runner: linux-aarch64-a3-0
        image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
        replicas: 1
        size: ${{ matrix.test_config.size }}
        config_file_path: ${{ matrix.test_config.config_file_path }}
-        secrets:
+    secrets:
        KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
   ```
  
@@ -179,14 +179,14 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
                    name: dshm
            volumes:
            - name: dshm
-            emptyDir:
+                emptyDir:
                medium: Memory
                sizeLimit: 15Gi
            - name: shared-volume
-            persistentVolumeClaim:
+                persistentVolumeClaim:
                claimName: nv-action-vllm-benchmarks-v2
            - name: driver-tools
-            hostPath:
+                hostPath:
                path: /usr/local/Ascend/driver/tools
        workerTemplate:
        spec:
@@ -224,14 +224,14 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/
                    name: dshm
            volumes:
            - name: dshm
-            emptyDir:
+                emptyDir:
                medium: Memory
                sizeLimit: 15Gi
            - name: shared-volume
-            persistentVolumeClaim:
+                persistentVolumeClaim:
                claimName: nv-action-vllm-benchmarks-v2
            - name: driver-tools
-            hostPath:
+                hostPath:
                path: /usr/local/Ascend/driver/tools
    ---
    apiVersion: v1
--- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
@@ -332,7 +332,6 @@ An L0 `dump.json` contains forward I/O for modules together with parameters. Usi
     "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
    }
   }
-  },
  }
 }
 }
@@ -389,7 +388,6 @@ An L1 `dump.json` records forward I/O for APIs. Using PyTorch's `relu` function
     "data_name": "Functional.relu.0.forward.output.0.pt"
    }
   ]
-  },
  }
 }
 }  
--- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
@@ -111,7 +111,7 @@ sudo apt update
 sudo apt install libjemalloc2

 # Configure jemalloc
-export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
+export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
 ```

 #### 2.2. Tcmalloc
--- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
+++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
@@ -97,7 +97,8 @@ For local `dataset-path`, please set `hf-name` to its Hugging Face ID like
 First start serving your model:

 ```bash
-VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
+export VLLM_USE_MODELSCOPE=True 
+vllm serve Qwen/Qwen3-8B
 ```

 Then run the benchmarking script:
@@ -158,7 +159,7 @@ vllm bench throughput \
 If successful, you will see the following output

 ```shell
-Processed prompts: 100%|█| 10/10 [00:03<00:00,  2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
+Processed prompts: 100%|█| 10/10 [00:03<00:00,  2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 toks/s
 Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
 Total num prompt tokens:  1280
 Total num output tokens:  1280