fix multiproc executor determine kv cache memory & update Dockerfile

adapt to vllm-ascend v0.18.0rc1
2026-04-24 12:56:40 +00:00 · 2026-04-21 03:05:32 +00:00
336 changed files with 8682 additions and 50213 deletions
--- a/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md
+++ b/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md
@@ -32,7 +32,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th
 - Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855)
 - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952)
 - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875)
- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)
+- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193)

 ### Dependencies

--- a/.github/workflows/misc/model_list.json
+++ b/.github/workflows/misc/model_list.json
@@ -18,8 +18,6 @@
      "Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot",
      "Eco-Tech/Qwen3-30B-A3B-w8a8",
      "Eco-Tech/Kimi-K2.5-W4A8",
-      "Eco-Tech/Qwen3-VL-235B-A22B-Instruct-w8a8-QuaRot",
-      "Eco-Tech/Qwen3-VL-32B-Instruct-w8a8-QuaRot",
      "Howeee/Qwen2.5-1.5B-apeach",
      "IntervitensInc/pangu-pro-moe-model",
      "IntervitensInc/pangu-pro-moe-modelt",
@@ -245,9 +243,6 @@
      "xlangai/OpenCUA-7B",
      "Eco-Tech/GLM-5-w4a8",
      "Eco-Tech/GLM-4.7-W8A8-floatmtp",
-      "MiniMax/MiniMax-M2.5",
-      "Eco-Tech/Qwen3.5-27B-w8a8-mtp",
-      "Eco-Tech/MiniMax-M2.5-w8a8-QuaRot",
-      "Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp"
+      "MiniMax/MiniMax-M2.5"
    ]
  }
--- a/.github/workflows/schedule_nightly_test_a2.yaml
+++ b/.github/workflows/schedule_nightly_test_a2.yaml
@@ -87,9 +87,6 @@ jobs:
            os: linux-aarch64-a2b3-4
            tests: tests/e2e/nightly/single_node/ops/multicard_ops_a2/
          # YAML-driven tests
-          - name: qwen3-vl-32b-instruct-w8a8
-            os: linux-aarch64-a2b3-4
-            config_file_path: Qwen3-VL-32B-Instruct-W8A8.yaml
          - name: qwen3-32b
            os: linux-aarch64-a2b3-4
            config_file_path: Qwen3-32B.yaml
--- a/.github/workflows/schedule_nightly_test_a3.yaml
+++ b/.github/workflows/schedule_nightly_test_a3.yaml
@@ -119,9 +119,6 @@ jobs:
          - name: multi-node-deepseek-v3.2-W8A8-EP
            config_file_path: DeepSeek-V3_2-W8A8-EP.yaml
            size: 4
-          - name: multi-node-deepseek-v3.2-W8A8-EP-aime2025
-            config_file_path: DeepSeek-V3_2-W8A8-EP-aime2025.yaml
-            size: 4
    uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
    with:
      soc_version: a3
@@ -160,9 +157,6 @@ jobs:
            os: linux-aarch64-a3-16
            tests: tests/e2e/nightly/single_node/ops/multicard_ops_a3/
          # YAML-driven tests
-          - name: qwen3-vl-235b-a22b-instruct-w8a8
-            os: linux-aarch64-a3-16
-            config_file_path: Qwen3-VL-235B-A22B-Instruct-W8A8.yaml
          - name: deepseek-r1-0528-w8a8
            os: linux-aarch64-a3-16
            config_file_path: DeepSeek-R1-0528-W8A8.yaml
@@ -217,21 +211,15 @@ jobs:
          - name: qwen2-5-vl-32b
            os: linux-aarch64-a3-4
            config_file_path: Qwen2.5-VL-32B-Instruct.yaml
+          - name: qwen3-32b-int8-a3-feature-stack3
+            os: linux-aarch64-a3-4
+            config_file_path: Qwen3-32B-Int8-A3-Feature-Stack3.yaml
          - name: qwen3-32b-int8-prefix-cache
            os: linux-aarch64-a3-4
            config_file_path: Prefix-Cache-Qwen3-32B-Int8.yaml
          - name: deepseek-r1-0528-w8a8-prefix-cache
            os: linux-aarch64-a3-16
            config_file_path: Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml
-          - name: Qwen3.5-27B-w8a8-A3
-            os: linux-aarch64-a3-2
-            config_file_path: Qwen3.5-27B-w8a8-A3.yaml
-          - name: MiniMax-M2.5-w8a8
-            os: linux-aarch64-a3-16
-            config_file_path: MiniMax-M2.5-W8A8-A3.yaml
-          - name: Qwen3.5-397B-A17B-w8a8-mtp
-            os: linux-aarch64-a3-16
-            config_file_path: Qwen3.5-397B-A17B-W8A8-mtp-A3.yaml
    uses: ./.github/workflows/_e2e_nightly_single_node.yaml
    with:
      runner: ${{ matrix.test_config.os }}
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -50,8 +50,6 @@ All environment variables must be defined in `vllm_ascend/envs.py` using the cen
 **Example:**

 ```python
-import os
-
 env_variables = {
    "VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
    # ...
@@ -101,7 +99,7 @@ pytest -sv tests/ut/ops/test_prepare_finalize.py
 pytest -sv tests/ut/ops/test_prepare_finalize.py::test_prepare_inputs

 # Run NPU-specific tests (requires NPU hardware)
-pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency.py
+pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency
 ```

 **Requirement**: Run all tests locally before requesting review. Verify tests pass on NPU hardware for NPU-specific changes.
@@ -165,7 +163,7 @@ pytest -sv tests/e2e/singlecard/test_piecewise_res_consistency.py

 **Warning**: `tensor.item()` operations cause synchronization overhead on NPU when the `tensor` is on device.

-If the `tensor` is a device tensor, calling `item()` will triggers a synchronous data transfer from NPU to CPU. This can severely degrade performance in hot paths, causing `AsyncScheduler` to block here.
+If the `tensor` is a device tensor, the operator `item()` will trigger a synchronous data transfer from NPU to CPU, which can severely degrade performance in hot paths, cause this will make `AsyncScheduler` block here.

 **Review Requirements:**

--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1 +1 @@
-IMPORTANT: Ensure you've thoroughly reviewed the [AGENTS.md](AGENTS.md) file before beginning any work.
+IMPORTANT: Ensure you’ve thoroughly reviewed the [AGENTS.md](AGENTS.md) file before beginning any work.
--- a/2
+++ b/2
@@ -18,7 +18,7 @@
 FROM quay.io/ascend/cann:8.5.1-910b-ubuntu22.04-py3.11

 ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
-ARG MOONCAKE_TAG="v0.3.9"
+ARG MOONCAKE_TAG="v0.3.8.post1"
 ARG SOC_VERSION="ascend910b1"

 WORKDIR /workspace
--- a/Dockerfile.a3
+++ b/Dockerfile.a3
@@ -18,7 +18,7 @@
 FROM quay.io/ascend/cann:8.5.1-a3-ubuntu22.04-py3.11

 ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
-ARG MOONCAKE_TAG=v0.3.9
+ARG MOONCAKE_TAG=v0.3.8.post1
 ARG SOC_VERSION="ascend910_9391"

 COPY . /vllm-workspace/vllm-ascend/
--- a/Dockerfile.a3.openEuler
+++ b/Dockerfile.a3.openEuler
@@ -18,7 +18,7 @@
 FROM quay.io/ascend/cann:8.5.1-a3-openeuler24.03-py3.11

 ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
-ARG MOONCAKE_TAG="v0.3.9"
+ARG MOONCAKE_TAG="v0.3.8.post1"
 ARG SOC_VERSION="ascend910_9391"

 RUN pip config set global.index-url ${PIP_INDEX_URL}
--- a/Dockerfile.openEuler
+++ b/Dockerfile.openEuler
@@ -18,7 +18,7 @@
 FROM quay.io/ascend/cann:8.5.1-910b-openeuler24.03-py3.11

 ARG PIP_INDEX_URL="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"
-ARG MOONCAKE_TAG="v0.3.9"
+ARG MOONCAKE_TAG="v0.3.8.post1"
 ARG SOC_VERSION="ascend910b1"

 RUN pip config set global.index-url ${PIP_INDEX_URL}
--- a/README-vllm-ascend.md
+++ b/README-vllm-ascend.md
@@ -30,7 +30,7 @@ vLLM Ascend Plugin
 - [2025/12] We released the new official version [v0.11.0](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.11.0)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.11.0/) to start using vLLM Ascend Plugin on Ascend.
 - [2025/09] We released the new official version [v0.9.1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.9.1)! Please follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html) to start deploying large-scale Expert Parallelism (EP) on Ascend.
 - [2025/08] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/7n8OYNrCC_I9SJaybHA_-Q) with vLLM and Tencent! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
- [2025/06] [User stories](https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
+- [2025/06] [User stories](https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/index.html) page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend‌ assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
 - [2025/06] [Contributors](https://docs.vllm.ai/projects/ascend/en/latest/community/contributors.html) page is now live! All contributions deserve to be recorded, thanks for all contributors.
 - [2025/05] We've released the first official version [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3)! We collaborated with the vLLM community to publish a blog post sharing our practice: [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
 - [2025/03] We hosted the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/VtxO9WXa5fC-mKqlxNUJUQ) with vLLM team! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF).
@@ -103,4 +103,4 @@ Please refer to [Versioning policy](https://docs.vllm.ai/projects/ascend/en/late

 ## License

-Apache License 2.0, as found in the [LICENSE](./LICENSE) file.
+Apache License 2.0, as found in the [LICENSE](./LICENSE) file.
--- a/README-vllm-ascend.zh.md
+++ b/README-vllm-ascend.zh.md
@@ -62,7 +62,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP

 ## 贡献

-请参考[CONTRIBUTING](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
+请参考 [CONTRIBUTING]((https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html)) 文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。

 我们欢迎并重视任何形式的贡献与合作：

@@ -74,7 +74,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
 vllm-ascend有主干分支和开发分支。

 - **main**: 主干分支，与vLLM的主干分支对应，并通过昇腾CI持续进行质量看护。
- **releases/vX.Y.Z**: 开发分支，随vLLM部分新版本发布而创建，比如`releases/v0.13.0`是vllm-ascend针对vLLM `v0.13.0` 版本的开发分支。
+- **releases/vX.Y.Z**: 开发分支，随vLLM部分新版本发布而创建，比如`releases/v0.13.0`是vllm-asend针对vLLM `v0.13.0` 版本的开发分支。

 下面是维护中的分支：

@@ -97,4 +97,4 @@ vllm-ascend有主干分支和开发分支。

 ## 许可证

-Apache 许可证 2.0，如 [LICENSE](./LICENSE) 文件中所示。
+Apache 许可证 2.0，如 [LICENSE](./LICENSE) 文件中所示。
--- a/README.md
+++ b/README.md
@@ -30,7 +30,6 @@ docker build -t $build_image -f ./Dockerfile .
 ### Environment Variables
 - `VNPU_RESERVED_VRAM_SIZE_GB`: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for `vllm_vnpu_daemon`. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: `8`.
 - `VLLM_VNPU_SHM_NAME`: The name of the shm file. Needs to be set for all containers of the shared vNPU group. Default: `/vllm_acl_vnpu_offload_shm`.
- `VLLM_VNPU_PRIORITY`: The priority of LLM services. High-priority LLM services are prioritized when processing requests. The value must be an integer in the range [0, 7]. Default: `0`.


 ## Limitations
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -46,7 +46,7 @@ Before running the benchmarks, ensure the following:
  ```
  
 - For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run a customized benchmark, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
+- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:

  ```json
  [
@@ -132,11 +132,12 @@ Once the script completes, you can find the results in the benchmarks/results fo

 ```shell
 .
-|-- serving_qwen2_5_7Bvl_tp1_qps_1.json
-|-- serving_qwen2_5_7Bvl_tp1_qps_16.json
-|-- serving_qwen2_5_7Bvl_tp1_qps_4.json
-|-- serving_qwen2_5_7Bvl_tp1_qps_inf.json
-|-- throughput_qwen2_5_7Bvl_tp1.json
+|-- serving_qwen2_5_7B_tp1_qps_1.json
+|-- serving_qwen2_5_7B_tp1_qps_16.json
+|-- serving_qwen2_5_7B_tp1_qps_4.json
+|-- serving_qwen2_5_7B_tp1_qps_inf.json
+|-- latency_qwen2_5_7B_tp1.json
+|-- throughput_qwen2_5_7B_tp1.json
 ```

 These files contain detailed benchmarking results for further analysis.
--- a/csrc/camem_allocator.cpp
+++ b/csrc/camem_allocator.cpp
@@ -595,29 +595,10 @@ static PyObject* python_try_lock_gpu_offload(PyObject* self, PyObject* args) {
 }

 static PyObject* python_unlock_gpu_offload(PyObject* self, PyObject* args) {
-  int keep_wait = 0;
-  if (!PyArg_ParseTuple(args, "|p", &keep_wait)) {
-    return NULL;
-  }
-  shm_worker->unlock_gpu(keep_wait != 0);
+  shm_worker->unlock_gpu();
  Py_RETURN_NONE;
 }

-static PyObject* python_start_wait_offload(PyObject* self, PyObject* args) {
-  shm_worker->start_wait();
-  Py_RETURN_NONE;
-}
-
-static PyObject* python_cancel_wait_offload(PyObject* self, PyObject* args) {
-  shm_worker->cancel_wait();
-  Py_RETURN_NONE;
-}
-
-static PyObject* python_has_higher_priority_waiter_offload(PyObject* self, PyObject* args) {
-  bool has_higher = shm_worker->has_higher_priority_waiter();
-  return PyBool_FromLong(has_higher);
-}
-
 static PyMethodDef module_methods[] = {
    {"init_module", (PyCFunction)py_init_module, METH_VARARGS,
     "Initialize module with python_malloc and python_free callables."},
@@ -638,13 +619,7 @@ static PyMethodDef module_methods[] = {
    {"python_try_lock_gpu_offload", (PyCFunction)python_try_lock_gpu_offload,
     METH_NOARGS, "Lock GPU."},
    {"python_unlock_gpu_offload", (PyCFunction)python_unlock_gpu_offload,
-     METH_VARARGS, "Unlock GPU."},
-    {"python_start_wait_offload", (PyCFunction)python_start_wait_offload,
-     METH_NOARGS, "Start waiting for GPU lock."},
-    {"python_cancel_wait_offload", (PyCFunction)python_cancel_wait_offload,
-     METH_NOARGS, "Cancel waiting for GPU lock."},
-    {"python_has_higher_priority_waiter_offload", (PyCFunction)python_has_higher_priority_waiter_offload,
-     METH_NOARGS, "Check if there is a higher priority waiter."},
+     METH_NOARGS, "Unlock GPU."},
    {NULL, NULL, 0, NULL} // sentinel
 };

--- a/csrc/dispatch_ffn_combine/op_kernel/dispatch_ffn_combine.h
+++ b/csrc/dispatch_ffn_combine/op_kernel/dispatch_ffn_combine.h
@@ -224,7 +224,7 @@ __aicore__ inline void DispatchFFNCombine<TemplateMMA2ACFunc>::Process()
    constexpr uint32_t ubStages = 2;

    using EpilogueDispatchPolicy1 = Epilogue::EpilogueAtlasA2PerTokenDequantSwigluQuant<ubStages>;
-    
+
    using ScaleType = Gemm::GemmType<uint64_t, layout::VectorLayout>;
    using PerTokenScaleType = Gemm::GemmType<float, layout::VectorLayout>;
    using ElementMulType = Gemm::GemmType<float, layout::RowMajor>;
@@ -234,8 +234,7 @@ __aicore__ inline void DispatchFFNCombine<TemplateMMA2ACFunc>::Process()
    using BlockEpilogue1 = Epilogue::Block::BlockEpilogue<EpilogueDispatchPolicy1, CType, PerTokenScaleType,
        D1Type, TileElemWiseMuls, TileCopy1>;

-    using EpilogueDispatchPolicy2 = Epilogue::EpilogueAtlasA2PerTokenDequant<ubStages>;
-
+    using EpilogueDispatchPolicy2 = Epilogue::EpilogueAtlasA2PerTokenDequantV2<ubStages>;
    using TileCopy2 = Epilogue::Tile::TileCopy<ArchTag, CType, ScaleType, PerTokenScaleType, D2Type>;
    using BlockEpilogue2 = Epilogue::Block::BlockEpilogue<EpilogueDispatchPolicy2, CType,PerTokenScaleType,
        D2Type, TileCopy2>;
@@ -255,11 +254,9 @@ __aicore__ inline void DispatchFFNCombine<TemplateMMA2ACFunc>::Process()

    GemmCoord problemShape{static_cast<uint32_t>(m), static_cast<uint32_t>(n), static_cast<uint32_t>(k)};

-    uint32_t epilogueCoreNum = aivNum;
-    uint32_t epilogueGranularity = expertPerRank - 3;
-    if (expertPerRank <= 4) {
-        epilogueGranularity = expertPerRank - 1;
-    }
+    uint32_t epilogueCoreNum = aivNum / 2;
+    uint32_t epilogueGranularity = expertPerRank - 1;
+
    typename MatmulKernel::Params params{
        problemShape, static_cast<uint32_t>(EP), static_cast<uint32_t>(listLen), static_cast<uint32_t>(expertPerRank), static_cast<uint32_t>(maxOutputSize),
        static_cast<uint32_t>(rank), static_cast<uint32_t>(rankSize),
@@ -280,4 +277,4 @@ __aicore__ inline void DispatchFFNCombine<TemplateMMA2ACFunc>::Process()
 }

 } // DispatchFFNCombineImpl
-#endif // DISPATCH_FFN_COMBINE_H
+#endif // DISPATCH_FFN_COMBINE_H
--- a/csrc/dispatch_ffn_combine/op_kernel/dispatch_ffn_combine_kernel.hpp
+++ b/csrc/dispatch_ffn_combine/op_kernel/dispatch_ffn_combine_kernel.hpp
@@ -571,7 +571,6 @@ private:
        if constexpr (BlockMmad::DispatchPolicy::ASYNC) {
            blockMmad.SynchronizeBlock();
        }
-        blockMmad.Finalize(params.expertPerRank - 1, 0);
    }


@@ -728,6 +727,19 @@ private:
    }


+    CATLASS_DEVICE
+    void CombineSetFlag() {
+        AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0);
+        AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID1);
+        AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID2);
+        AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID3);
+        AscendC::SetFlag<AscendC::HardEvent::S_MTE2>(EVENT_ID2);
+        AscendC::SetFlag<AscendC::HardEvent::S_MTE2>(EVENT_ID3);
+        AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0);
+        AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID1);
+    }
+
+
    CATLASS_DEVICE
    void DispatchAndCombine(Params const &params) {
        icache_preload(8);
@@ -788,17 +800,13 @@ private:
                    GM_ADDR otherRankPtr = shmem(0, dstEpIdx);
                    AscendC::GlobalTensor<ElementA> gmRemoteA;
                    gmRemoteA.SetGlobalBuffer(reinterpret_cast<__gm__ ElementA*>(otherRankPtr + peermemInfo.offsetA));
-                    AscendC::GlobalTensor<ElementPerTokenScale> gmRemotePerTokenScale;
-                    gmRemotePerTokenScale.SetGlobalBuffer(reinterpret_cast<__gm__ ElementPerTokenScale*>(otherRankPtr + peermemInfo.offsetPeerPerTokenScale));
+
                    MatrixCoord offsetA{rowStart, 0};
                    MatrixCoord offsetPeer{rowSrc, 0};
                    int64_t gmOffsetA = params.layoutA.GetOffset(offsetA);
-                    int64_t gmOffsetPeer = params.layoutA.GetOffset(offsetPeer);
-
+                    int64_t gmOffsetPeer = rowSrc * (params.problemShape.k() + ALIGN_512);
                    // Communication data
-                    CopyGMToGM(gmA[gmOffsetA], gmRemoteA[gmOffsetPeer], rows * params.problemShape.k(), params.ubMoveNum);
-                    // Communication scale
-                    CopyGMToGM(gmPerTokenScale1[rowStart], gmRemotePerTokenScale[rowSrc], rows, rows);
+                    CopyGMToGMPerToken(gmA[gmOffsetA], gmPerTokenScale1[rowStart], gmRemoteA[gmOffsetPeer],  rows, params.problemShape.k());
                }

            }
@@ -829,12 +837,16 @@ private:

        uint32_t n2 = params.problemShape.k();

-
        typename BlockEpilogue2::Params epilogueParams{
            static_cast<int32_t>(params.EP),
            static_cast<int32_t>(params.expertPerRank),
+            static_cast<int32_t>(params.rank),
            reinterpret_cast<__gm__ int32_t *>(shmem() + peermemInfo.offsetPeerTokenPerExpert),
-            static_cast<int32_t>(n2)
+            params.layoutD2,
+            static_cast<int32_t>(n2),
+            static_cast<int32_t>(L1TileShape::N),
+            shmem,
+            static_cast<int32_t>(peermemInfo.offsetD)
        };

        uint32_t n = params.problemShape.n();
@@ -878,65 +890,109 @@ private:

        blockEpilogue1.Finalize();

-        blockEpilogue2.SetFlag();
-        CombineV1(params, blockEpilogue2);
+
+        CombineSetFlag();
+
+        CombineV2(params, blockEpilogue2);
+
        AscendC::SyncAll<true>();
        #ifndef __CROSSRANKSYNCANDALLGATHERV1__
        ResetTokenPerExpert(params.EP * AlignUp(params.EP * params.expertPerRank, 128));
        #endif
-
-        shmem.CrossRankSync();
-
-        MoeTokenUnpermuteTilingData tilingData;
-        MoeTokenUnpermuteTiling(params.problemShape.m() * params.topK, n2, params.topK, tilingData, coreNum);
-        KernelMoeTokenUnpermute<ElementD2, int32_t, float, true> kernelMoeTokenUnpermuteOp;
-        kernelMoeTokenUnpermuteOp.Init(shmem() + peermemInfo.offsetD, workspaceInfo.expandedRowIdx, params.probs, reinterpret_cast<GM_ADDR>(params.ptrOutput), &tilingData);
-        kernelMoeTokenUnpermuteOp.Process();
+        shmem.InitStatusTargetSum();
+        if (get_subblockid() == 0) {
+            AscendC::LocalTensor<int32_t> ctrBuffer = resource.ubBuf.template GetBufferByByte<int32_t>(0);
+            shmem.CrossRankSyncV2Set(ctrBuffer);
+        } else {
+            uint32_t uboffset = 0;
+            uint32_t aicCoreNum = coreNum / 2;
+            uint32_t aicCoreIdx = get_block_idx();
+            uint32_t sendRankNum_ = params.EP / aicCoreNum;
+            uint32_t remainderRankNum = params.EP % aicCoreNum;
+            if (aicCoreIdx < remainderRankNum) {
+                sendRankNum_++;
+            }
+            AscendC::LocalTensor<float> statusTensor = resource.ubBuf.template GetBufferByByte<float>(uboffset);
+            uboffset += sendRankNum_ * UB_ALIGN;
+            AscendC::LocalTensor<float> gatherMaskOutTensor = resource.ubBuf.template GetBufferByByte<float>(uboffset);
+            uboffset += AlignUp(params.EP * sizeof(float), 32);
+            AscendC::LocalTensor<uint32_t> gatherTmpTensor = resource.ubBuf.template GetBufferByByte<uint32_t>(uboffset);
+            uboffset += AlignUp(sizeof(uint32_t), 32);
+            AscendC::LocalTensor<float> statusSumOutTensor = resource.ubBuf.template GetBufferByByte<float>(uboffset);
+            uboffset += AlignUp(sizeof(float), 32);
+            shmem.CrossRankSyncV2Wait(statusTensor, gatherMaskOutTensor, gatherTmpTensor, statusSumOutTensor);
+            MoeTokenUnpermuteTilingData tilingData;
+            MoeTokenUnpermuteTiling(params.problemShape.m() * params.topK, n2, params.topK, tilingData, coreNum / 2);
+            KernelMoeTokenUnpermute<ElementD2, int32_t, float, true> kernelMoeTokenUnpermuteOp;
+            kernelMoeTokenUnpermuteOp.Init(shmem() + peermemInfo.offsetD, workspaceInfo.expandedRowIdx, params.probs, reinterpret_cast<GM_ADDR>(params.ptrOutput), &tilingData);
+            kernelMoeTokenUnpermuteOp.Process();
+        }
+ 
    }
+
    CATLASS_DEVICE
-    void CombineV1(Params const &params, BlockEpilogue2 & blockEpilogue) {
+    void CombineV2(Params const &params, BlockEpilogue2 & blockEpilogue) {
+        BlockScheduler blockScheduler;
+        int32_t syncLoopIdx = 0;
+        uint32_t startCoreIdx = 0;
+        uint32_t aicCoreNum = coreNum / 2;
+        uint32_t aicCoreIdx = get_block_idx();
+        uint32_t aivSubCoreIdx = get_subblockid();
+        uint32_t preSrcExpertSum = 0;
        uint32_t n2 = params.problemShape.k();
-        int32_t prevGroupSum2 = 0;
-
+        uint32_t k2 = params.problemShape.n() / 2;
        icache_preload(8);
-        for (uint32_t t_groupIdx = 0; t_groupIdx < params.expertPerRank; ++t_groupIdx) {
-            int32_t flagId = t_groupIdx / CROSS_CORE_FLAG_MAX_SET_COUNT;
-            AscendC::CrossCoreWaitFlag<0x2>(flagId);
-            AscendC::SyncAll<true>();
+        for (uint32_t groupIdx = 0; groupIdx < params.expertPerRank; ++groupIdx) {
+            uint32_t currentExpertM = cumsumMM((params.EP - 1) * params.expertPerRank + groupIdx);
+            if (preSrcExpertSum >= params.maxOutputSize) {
+                currentExpertM = 0;
+            } else if (preSrcExpertSum + currentExpertM > params.maxOutputSize) {
+                currentExpertM = params.maxOutputSize - preSrcExpertSum;
+            }
+            GemmCoord inGroupProblemShape{currentExpertM, n2, k2}; // M N K
+            blockScheduler.Update(inGroupProblemShape, MakeCoord(L1TileShape::M, L1TileShape::N));
+            uint32_t coreLoops = blockScheduler.GetCoreLoops();
+            uint32_t startLoopIdx = ((aicCoreIdx < startCoreIdx) ? (aicCoreIdx + aicCoreNum) : aicCoreIdx) - startCoreIdx;

-            uint32_t groupIdx = t_groupIdx;
+            for (uint32_t loopIdx = startLoopIdx; loopIdx < coreLoops; loopIdx += aicCoreNum) {
+                GemmCoord blockCoord = blockScheduler.GetBlockCoord(loopIdx);
+                GemmCoord actualBlockShape = blockScheduler.GetActualBlockShape(blockCoord);
+                int32_t m0 = 16;
+                //  Block count, the shape of each block is (m0, actualBlockShape.n())
+                int32_t m_rows = (actualBlockShape.m() + m0 - 1) / m0;
+                int32_t aiv_m_rows = m_rows / 2;
+                if (aivSubCoreIdx == 1 && aiv_m_rows * 2 < m_rows) {
+                    aiv_m_rows += 1;
+                }
+                uint32_t m_offset = blockCoord.m() * L1TileShape::M;//blockOffset
+                if(aivSubCoreIdx == 1) {
+                    m_offset += (m_rows / 2) * m0;
+                }

-            for(int32_t dstEpIdx = coreIdx; dstEpIdx < params.EP; dstEpIdx += coreNum) {
-                __gm__ void* dstPeermemPtr = shmem(peermemInfo.offsetD, dstEpIdx);
-                AscendC::GlobalTensor<ElementD2> gmRemotePeer;
-                gmRemotePeer.SetGlobalBuffer(reinterpret_cast<__gm__ ElementD2*>(dstPeermemPtr));
-                uint32_t srcRowOffset = (dstEpIdx == 0 ? 0 : cumsumMM((dstEpIdx - 1) * params.expertPerRank + groupIdx)) + prevGroupSum2;
-                if (srcRowOffset < params.maxOutputSize) {
-                    uint32_t dataRows = tokenPerExpert(tokenPerExpertLayout(dstEpIdx, params.rank, groupIdx));
-                    if (srcRowOffset + dataRows > params.maxOutputSize) {
-                        dataRows = params.maxOutputSize - srcRowOffset;
-                    }
-                    //uint32_t dstRowOffset = preSumBeforeRank(2 * dstEpIdx * FLAGSTRIDE + groupIdx);
-                    int32_t tmpBlock = AlignUp(params.expertPerRank, FLAGSTRIDE);
-                    //uint32_t dstRowOffset = preSumBeforeRank(dstEpIdx * tmpBlock + groupIdx);
-                    uint32_t dstRowOffset = preSumBeforeRank(dstEpIdx * params.expertPerRank + groupIdx);
-                    MatrixCoord offsetC{srcRowOffset, 0};
-                    MatrixCoord offsetPeer{dstRowOffset, 0};
-                    MatrixCoord shapeC{dataRows, n2};
-                    int64_t gmOffsetC = params.layoutD2.GetOffset(offsetC);
-                    int64_t gmOffsetPeer = params.layoutD2.GetOffset(offsetPeer);
-                    if constexpr (std::is_same_v<ElementA, int8_t>) {
-                        blockEpilogue(gmC2[gmOffsetC], shapeC, gmPerTokenScale2[srcRowOffset], gmRemotePeer[gmOffsetPeer]);
-                    } else {
-                        blockEpilogue(gmC2[gmOffsetC], shapeC, gmRemotePeer[gmOffsetPeer]);
+
+                for (;syncLoopIdx <= groupIdx; syncLoopIdx ++) {
+                    int32_t flag_id = syncLoopIdx / CROSS_CORE_FLAG_MAX_SET_COUNT;
+                    AscendC::CrossCoreWaitFlag<0x2>(flag_id);
+                }
+
+                for (int32_t cur_row = 0; cur_row < aiv_m_rows; cur_row ++) {
+                    GemmCoord realTileCoord{m_offset, blockCoord.n() * L1TileShape::N, 1};
+                    uint32_t actualm = m0;
+                    if(aivSubCoreIdx == 1 && cur_row == aiv_m_rows - 1){
+                        actualm = actualBlockShape.m() - (m_rows / 2) * m0 - cur_row * m0;
                    }
+                    GemmCoord realTileShape{actualm, actualBlockShape.n(), 1};
+                    blockEpilogue(gmC2, gmPerTokenScale2, realTileCoord, realTileShape, groupIdx, preSrcExpertSum, preSumBeforeRank);
+                    m_offset += m0;
                }
            }
-            prevGroupSum2 += cumsumMM((params.EP - 1) * params.expertPerRank + groupIdx);
+            preSrcExpertSum += currentExpertM;
+            startCoreIdx = (startCoreIdx + coreLoops) % aicCoreNum;
        }
        blockEpilogue.Finalize();
    }

+
 private:
  struct WorkspaceInfo {
        GM_ADDR ptrA;
@@ -1040,4 +1096,4 @@ private:

 } // namespace Catlass::Gemm::Kernel

-#endif // DISPATCH_FFN_COMBINE_KERNEL_HPP
+#endif // DISPATCH_FFN_COMBINE_KERNEL_HPP
--- a/csrc/dispatch_ffn_combine/op_kernel/moe_init_routing_quant_v2/moe_v2_fullload_dynamic_quant.h
+++ b/csrc/dispatch_ffn_combine/op_kernel/moe_init_routing_quant_v2/moe_v2_fullload_dynamic_quant.h
@@ -35,7 +35,6 @@ class MoeV2FullLoadDynamicQuant : public MoeV2SortBase {
  __aicore__ inline void CopyOutIdx();
  __aicore__ inline void CopyOutEmpty();
  __aicore__ inline void CopyOutXQuant1H();
-  __aicore__ inline void CopyOutXQuantEH();
  __aicore__ inline void ComputeExpertTokenCountOrCumsum();
  __aicore__ inline void Compute(LocalTensor<float>& smoothLocal);

@@ -49,6 +48,7 @@ class MoeV2FullLoadDynamicQuant : public MoeV2SortBase {
  int64_t k_;
  int64_t n_;
  int64_t cols_;
+  int64_t cols_scale_;
  int64_t activateRows_;
  int64_t expertNum;
  int64_t expertCapacity;
@@ -63,12 +63,10 @@ class MoeV2FullLoadDynamicQuant : public MoeV2SortBase {
  TQue<QuePosition::VECIN, 1> smoothInQueue;
  TQue<QuePosition::VECOUT, 1> calcQueue;
  TQue<QuePosition::VECOUT, 1> inputXOutQueue;
-  TQue<QuePosition::VECOUT, 1> scaleOutQueue;

  GlobalTensor<T> xGm_;
  GlobalTensor<int32_t> expertIdxGm_;
  GlobalTensor<float> quantSmoothGm;
-  GlobalTensor<float> dynamicQuantScaleGm;

  GlobalTensor<int8_t> expandedXGm_;
  GlobalTensor<int32_t> expandedRowIdxGm_;
@@ -225,7 +223,7 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Compute(LocalTensor<float>&

  LocalTensor<float> tempLocal = calcQueue.AllocTensor<float>();
  LocalTensor<int8_t> outLocal = inputXOutQueue.AllocTensor<int8_t>();
-  LocalTensor<float> dynamicQuantLocal = scaleOutQueue.AllocTensor<float>();
+  LocalTensor<float> dynamicQuantLocal = outLocal[this->cols_].template ReinterpretCast<float>();

  if constexpr (!IsSameType<T, float>::value) {
    Cast(inLocal, inLocal.ReinterpretCast<T>()[colsAlign], RoundMode::CAST_NONE, this->cols_);
@@ -259,7 +257,6 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Compute(LocalTensor<float>&

  calcQueue.FreeTensor(tempLocal);
  inputXOutQueue.EnQue(outLocal);
-  scaleOutQueue.EnQue(dynamicQuantLocal);
 }

 template <typename T>
@@ -275,7 +272,7 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::CopyOutXQuant1H() {

  DataCopyExtParams dataXCopyParams{1, static_cast<uint32_t>(this->cols_ * sizeof(T)), 0, 0, 0};
  DataCopyExtParams smoothCopyParams{1, static_cast<uint32_t>(this->cols_ * sizeof(float)), 0, 0, 0};
-  DataCopyExtParams intriParams{1, static_cast<uint32_t>(this->cols_ * sizeof(int8_t)), 0, 0, 0};
+  DataCopyExtParams intriParams{1, static_cast<uint32_t>((this->cols_ + BLOCK_BYTES) * sizeof(int8_t)), 0, 0, 0};

  LocalTensor<float> smoothLocal;
  if (smoothType == 1) {
@@ -295,7 +292,6 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::CopyOutXQuant1H() {
    xCopyInQueue_.EnQue<T>(xLocal);
    Compute(smoothLocal);

-    LocalTensor<float> quantScaleLocal = scaleOutQueue.DeQue<float>();
    LocalTensor<int8_t> outLocal = inputXOutQueue.DeQue<int8_t>();
    while (curRowsStart <= curRowsEnd && curRowsStart / this->k_ == row) {
      int32_t outIndex = expandedRowIdx.GetValue(curRowsStart);
@@ -303,74 +299,15 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::CopyOutXQuant1H() {
      if (outIndex == -1 || (this->dropPadMode == DROPLESS_MODE && outIndex >= this->activateRows_)) {
        continue;
      }
-      DataCopyPad(expandedXGm_[outIndex * cols_], outLocal, intriParams);
-      DataCopyPad(dynamicQuantScaleGm[outIndex], quantScaleLocal, {1, 4, 0, 0, 0});
+      DataCopyPad(expandedXGm_[outIndex * this->cols_scale_], outLocal, intriParams);
    }

    xCopyInQueue_.FreeTensor(xLocal);
    inputXOutQueue.FreeTensor(outLocal);
-    scaleOutQueue.FreeTensor(quantScaleLocal);
-  }
-  if (smoothType == 1) {
-    smoothInQueue.FreeTensor(smoothLocal);
  }
  expandedRowIdxCopyOutQueue_.FreeTensor(expandedRowIdx);
 }

-template <typename T>
-__aicore__ inline void MoeV2FullLoadDynamicQuant<T>::CopyOutXQuantEH() {
-  LocalTensor<int32_t> expandedRowIdx = expandedRowIdxCopyOutQueue_.DeQue<int32_t>();
-  expandedRowIdxCopyOutQueue_.FreeTensor(expandedRowIdx);
-  Muls(expandDstToSrcRowLocal.ReinterpretCast<float>(), expandDstToSrcRowLocal.ReinterpretCast<float>(), (float)-1,
-       this->totalLength);
-  pipe_barrier(PIPE_V);
-  LocalTensor<int32_t> sortedRowIdx = expandDstToSrcRowLocal.ReinterpretCast<int32_t>();
-  Cast(sortedRowIdx, expandDstToSrcRowLocal.ReinterpretCast<float>(), RoundMode::CAST_ROUND, this->totalLength);
-
-  int64_t curRowsStart = this->blockIdx_ * this->perCoreRows_;
-  int64_t curRowsEnd = curRowsStart + this->coreRows_ - 1;
-
-  DataCopyExtParams dataXCopyParams{1, static_cast<uint32_t>(this->cols_ * sizeof(T)), 0, 0, 0};
-  DataCopyExtParams smoothCopyParams{1, static_cast<uint32_t>(this->cols_ * sizeof(float)), 0, 0, 0};
-  DataCopyExtParams intriParams{1, static_cast<uint32_t>(this->cols_ * sizeof(int8_t)), 0, 0, 0};
-
-  for (int64_t row = curRowsStart; row <= curRowsEnd; row++) {
-    if (this->dropPadMode == DROPLESS_MODE && row >= this->activateRows_) {
-      break;
-    }
-    int32_t srcIdx = sortedRowIdx.GetValue(row);
-    int32_t expertIdx = expandedExpertIdxLocal.GetValue(row);
-
-    LocalTensor<T> inLocal = xCopyInQueue_.AllocTensor<T>();
-    LocalTensor<float> smoothLocal = smoothInQueue.AllocTensor<float>();
-    if constexpr (IsSameType<T, float>::value) {
-      DataCopyPad(inLocal, xGm_[srcIdx / this->k_ * this->cols_], dataXCopyParams, {false, 0, 0, 0});
-    } else {
-      DataCopyPad(inLocal[colsAlign], xGm_[srcIdx / this->k_ * this->cols_], dataXCopyParams, {false, 0, 0, 0});
-    }
-    DataCopyPad(smoothLocal, quantSmoothGm[expertIdx * this->cols_], smoothCopyParams, {false, 0, 0, 0});
-    xCopyInQueue_.EnQue<T>(inLocal);
-    smoothInQueue.EnQue(smoothLocal);
-    smoothLocal = smoothInQueue.DeQue<float>();
-
-    Compute(smoothLocal);
-
-    LocalTensor<float> quantScaleLocal = scaleOutQueue.DeQue<float>();
-    DataCopyPad(dynamicQuantScaleGm[row], quantScaleLocal, {1, 4, 0, 0, 0});
-
-    LocalTensor<int8_t> outLocal = inputXOutQueue.DeQue<int8_t>();
-    DataCopyPad(expandedXGm_[row * this->cols_], outLocal, intriParams);
-
-    xCopyInQueue_.FreeTensor(inLocal);
-    smoothInQueue.FreeTensor(smoothLocal);
-    inputXOutQueue.FreeTensor(outLocal);
-    scaleOutQueue.FreeTensor(quantScaleLocal);
-  }
-
-  expandDstToSrcRowQueue_.FreeTensor(expandDstToSrcRowLocal);
-  expandedExpertIdxCopyOutQueue_.FreeTensor(expandedExpertIdxLocal);
-}
-
 template <typename T>
 __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Init(GM_ADDR x, GM_ADDR expertIdx, GM_ADDR expandedX,
                                                          GM_ADDR expandedRowIdx, GM_ADDR expertTokensCountOrCumsum,
@@ -384,6 +321,7 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Init(GM_ADDR x, GM_ADDR exp
  this->k_ = tilingData->k;
  this->n_ = tilingData->n;
  this->cols_ = tilingData->cols;
+  this->cols_scale_ = this->cols_ + ALIGN_512;
  this->needCoreNum_ = this->gatherOutTilingData_->needCoreNum;
  this->perCoreRows_ = this->gatherOutTilingData_->perCoreRows;
  this->activateRows_ = this->gatherOutTilingData_->activateRows;
@@ -414,7 +352,6 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Init(GM_ADDR x, GM_ADDR exp
                                                Align(this->expertNum, sizeof(int32_t)));
  }
  quantSmoothGm.SetGlobalBuffer((__gm__ float*)quantSmooth);
-  dynamicQuantScaleGm.SetGlobalBuffer((__gm__ float*)dynamicQuantScale);

  int64_t kvFactor = 2;
  int64_t buffSize = this->sortNum_ * sizeof(int32_t);
@@ -438,8 +375,7 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Init(GM_ADDR x, GM_ADDR exp
  }
  pipe->InitBuffer(smoothInQueue, 1, AlignBytes(this->cols_, sizeof(float)));
  pipe->InitBuffer(calcQueue, 1, AlignBytes(this->cols_, sizeof(float)));
-  pipe->InitBuffer(inputXOutQueue, 1, AlignBytes(this->cols_, sizeof(int8_t)));
-  pipe->InitBuffer(scaleOutQueue, 1, BLOCK_BYTES + BLOCK_BYTES);
+  pipe->InitBuffer(inputXOutQueue, 1, AlignBytes(this->cols_scale_, sizeof(int8_t)));
 }

 template <typename T>
@@ -455,11 +391,7 @@ __aicore__ inline void MoeV2FullLoadDynamicQuant<T>::Process() {
    } else {
      CopyOutEmpty();
    }
-    if (smoothType == 2) {
-      CopyOutXQuantEH();
-    } else {
-      CopyOutXQuant1H();
-    }
+    CopyOutXQuant1H();
  }
 }
 }  // namespace MoeInitRoutingQuantV2
--- a/csrc/dispatch_ffn_combine/op_kernel/moe_init_routing_quant_v2/moe_v2_gather_dynamic_quant.h
+++ b/csrc/dispatch_ffn_combine/op_kernel/moe_init_routing_quant_v2/moe_v2_gather_dynamic_quant.h
@@ -66,6 +66,7 @@ class MoeV2GatherDynamicQuant {
  int64_t needCoreNum;
  int64_t blockIdx;
  int64_t cols;
+  int64_t cols_scale_;
  int64_t n;
  int64_t k;
  int64_t totalLength;
@@ -117,7 +118,7 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::Compute(LocalTensor<float>& s

  LocalTensor<float> tempLocal = calcQueue.AllocTensor<float>();
  LocalTensor<int8_t> outLocal = inputXOutQueue.AllocTensor<int8_t>();
-  LocalTensor<float> dynamicQuantLocal = scaleOutQueue.AllocTensor<float>();
+  LocalTensor<float> dynamicQuantLocal = outLocal[this->cols].template ReinterpretCast<float>();

  if constexpr (!IsSameType<T, float>::value) {
    Cast(inLocal, inLocal.ReinterpretCast<T>()[perLoopColsAlign], RoundMode::CAST_NONE, this->cols);
@@ -151,7 +152,6 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::Compute(LocalTensor<float>& s

  calcQueue.FreeTensor(tempLocal);
  inputXOutQueue.EnQue(outLocal);
-  scaleOutQueue.EnQue(dynamicQuantLocal);
 }

 template <typename T>
@@ -163,7 +163,7 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::CopyOutXQuant1H(int64_t progr
  int64_t currentLoopStartRow = initialRow / this->k;
  int64_t currentLoopLastRow = (initialRow + this->currentLoopRows - 1) / this->k;
  DataCopyExtParams copyInParams{1, static_cast<uint32_t>(this->cols * sizeof(T)), 0, 0, 0};
-  DataCopyExtParams copyOutParams{1, static_cast<uint32_t>(this->cols * sizeof(int8_t)), 0, 0, 0};
+  DataCopyExtParams copyOutParams{1, static_cast<uint32_t>((this->cols + BLOCK_BYTES) * sizeof(int8_t)), 0, 0, 0};
  DataCopyExtParams smoothParams{1, static_cast<uint32_t>(this->cols * sizeof(float)), 0, 0, 0};

  LocalTensor<float> smoothLocal;
@@ -187,7 +187,6 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::CopyOutXQuant1H(int64_t progr
    // Compute quantization
    Compute(smoothLocal);

-    LocalTensor<float> quantScaleLocal = scaleOutQueue.DeQue<float>();
    LocalTensor<int8_t> outLocal = inputXOutQueue.DeQue<int8_t>();

    while (curLoopRow < this->currentLoopRows && initialRow / this->k == row) {
@@ -197,15 +196,11 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::CopyOutXQuant1H(int64_t progr
      if (outIndex == -1 || (this->dropPadMode == DROPLESS_MODE && outIndex >= this->activateRows)) {
        continue;
      }
-      DataCopyPad(expandedXGm[outIndex * cols], outLocal, copyOutParams);
-      DataCopyPad(dynamicQuantScaleGm[outIndex], quantScaleLocal, {1, 4, 0, 0, 0});
+      // Scale is placed after the data position
+      DataCopyPad(expandedXGm[outIndex * cols_scale_], outLocal, copyOutParams);
    }
    inputXInQueue.FreeTensor(inLocal);
    inputXOutQueue.FreeTensor(outLocal);
-    scaleOutQueue.FreeTensor(quantScaleLocal);
-  }
-  if (smoothType == 1) {
-    smoothInQueue.FreeTensor(smoothLocal);
  }
  expandRowIdxInQueue.FreeTensor(indicesLocal);
 }
@@ -463,6 +458,7 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::Init(GM_ADDR inputX, GM_ADDR
  this->needCoreNum = this->gatherOutTilingData->needCoreNum;
  this->activateRows = this->gatherOutTilingData->activateRows;
  this->cols = tilingData->cols;
+  this->cols_scale_ = this->cols + ALIGN_512;
  this->n = tilingData->n;
  this->k = tilingData->k;
  this->totalLength = tilingData->n * tilingData->k;
@@ -518,33 +514,15 @@ __aicore__ inline void MoeV2GatherDynamicQuant<T>::Init(GM_ADDR inputX, GM_ADDR
  pipe->InitBuffer(smoothInQueue, BUFFER_NUM, AlignBytes(this->perLoopCols, sizeof(float)));
  pipe->InitBuffer(calcQueue, 1, AlignBytes(this->perLoopCols, sizeof(float)));
  pipe->InitBuffer(inputXOutQueue, 1, AlignBytes(this->perLoopCols, sizeof(int8_t)));
-  pipe->InitBuffer(scaleOutQueue, 1, BLOCK_BYTES + BLOCK_BYTES);
 }

 template <typename T>
 __aicore__ inline void MoeV2GatherDynamicQuant<T>::Process() {
  if (this->blockIdx < this->needCoreNum) {
    currentLoopRows = perLoopRows;
-
-    if (colLoops > 1) {  // A single row cannot be fully loaded; workspace is required
-      if (smoothType == 2) {
-        for (int64_t loop = 0; loop < this->rowLoops - 1; loop++) {
-          CopyInExpandedExpertIdx(loop);
-          CopyOutPartialXQuantEH(loop);
-        }
-        currentLoopRows = lastLoopRows;
-        CopyInExpandedExpertIdx(this->rowLoops - 1);
-        CopyOutPartialXQuantEH(this->rowLoops - 1);
-      } else {
-        for (int64_t loop = 0; loop < this->rowLoops - 1; loop++) {
-          CopyInExpandedRowIdx(loop);
-          CopyOutPartialXQuant1H(loop);
-        }
-        currentLoopRows = lastLoopRows;
-        CopyInExpandedRowIdx(this->rowLoops - 1);
-        CopyOutPartialXQuant1H(this->rowLoops - 1);
-      }
-    } else {  // A single row can be fully loaded
+      if (colLoops > 1) {  // Cannot fit all data in one row, workspace is required
+        trap();   // Not supported
+      } else {  // All data can fit in one row
      if (smoothType == 2) {
        for (int64_t loop = 0; loop < this->rowLoops - 1; loop++) {
          CopyInExpandedExpertIdx(loop);
--- a/csrc/dispatch_ffn_combine/op_kernel/unpermute/moe_token_unpermute.h
+++ b/csrc/dispatch_ffn_combine/op_kernel/unpermute/moe_token_unpermute.h
@@ -85,8 +85,9 @@ KernelMoeTokenUnpermute<T1, T2, T3, PROBS>::Init(GM_ADDR permuted_tokens, GM_ADD
                                                 GM_ADDR unpermuted_tokens,
                                                 const MoeTokenUnpermuteTilingData *__restrict tiling_data)
 {
-    this->blockIdx = get_block_idx() + get_subblockid() * get_block_num();
-    this->blockNum = get_block_num() * get_subblockdim();
+    this->blockIdx = get_block_idx();
+    this->blockNum = get_block_num();
+
    if (blockIdx >= blockNum) {
        return;
    }
--- a/csrc/dispatch_ffn_combine/op_kernel/utils/block_epilogue_pertoken_row.hpp
+++ b/csrc/dispatch_ffn_combine/op_kernel/utils/block_epilogue_pertoken_row.hpp
@@ -99,20 +99,12 @@ public:
            eventUbDMTE3VList[i] = eventMTE3V++;
            eventUbDVMTE3List[i] = eventVMTE3++;

-
+            AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(eventUbCVMTE2List[i]);
+            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(eventUbDMTE3VList[i]);
            ubCFp32List[i] = resource.ubBuf.template GetBufferByByte<float>(ubOffset);
            ubOffset += blockN * sizeof(float);
        }
    }
-    CATLASS_DEVICE
-    void SetFlag() 
-    {
-        for (uint32_t i = 0; i < UB_STAGES; ++i) {
-            AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(eventUbCVMTE2List[i]);
-            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(eventUbDMTE3VList[i]);
-        }
-    }
-    
    CATLASS_DEVICE
    void Finalize()
    {
--- a/csrc/vnpu_offload/shm_helper.h
+++ b/csrc/vnpu_offload/shm_helper.h
@@ -16,8 +16,8 @@
 #include "spdlog/spdlog.h"


-#define MAX_WORKERS 64
-#define MAX_DEVICES 32
+#define MAX_WORKERS 60
+#define MAX_DEVICES 16

 static inline std::string get_shm_name() {
  const char *env_shm_name = getenv("VLLM_VNPU_SHM_NAME");
@@ -34,7 +34,7 @@ static inline std::string get_shm_name() {
 }

 static constexpr uint32_t heartbeat_us = 1000; // microseconds
-static constexpr uint32_t heartbeat_check_everyN = 100;
+static constexpr uint32_t heartbeat_check_everyN = 50;
 static constexpr uint32_t heartbeat_timeout_us =
    heartbeat_check_everyN * heartbeat_us;

@@ -52,8 +52,6 @@ static inline uint64_t heartbeat_ts_us() {
          .count());
 }

-// GPU flag layout (64 bits):
-// [lock (1 bit) | reserved (31 bits) | tgid (32 bits)]
 static inline uint32_t unpack_lock_field(uint64_t gpu_flag) {
  return static_cast<uint32_t>(gpu_flag >> 32);
 }
@@ -70,43 +68,16 @@ static inline uint64_t pack_unlocked_tgid(int32_t tgid) {
  return static_cast<uint64_t>(tgid);
 }

-// waiting_worker_flag layout (64 bits):
-// [ device_id (5 bits) | priority (3 bits) | timestamp (24 bits) | tgid (32 bits)]
-
-static inline uint32_t unpack_waiting_device_id(uint64_t flag) {
-  return static_cast<uint32_t>(flag >> 59);
-}
-
-static inline uint16_t unpack_waiting_priority(uint64_t flag) {
-  return static_cast<uint16_t>((flag >> 56) & 0x7);
-}
-
-static inline uint32_t unpack_waiting_timestamp_ms(uint64_t flag) {
-  return static_cast<uint32_t>((flag >> 32) & 0xFFFFFF);
-}
-
-static inline int32_t unpack_waiting_tgid(uint64_t flag) {
-  return static_cast<int32_t>(flag & 0xFFFFFFFF);
-}
-
-static inline uint64_t pack_waiting_flag(uint32_t device_id, uint16_t priority,
-                                         uint32_t timestamp, int32_t tgid) {
-  return (static_cast<uint64_t>(device_id & 0x1F) << 59) |
-         (static_cast<uint64_t>(priority & 0x7) << 56) |
-         (static_cast<uint64_t>(timestamp & 0xFFFFFF) << 32) |
-         (static_cast<uint64_t>(tgid) & 0xFFFFFFFF);
-}
-
 // mmap usually page-aligned
 struct alignas(64) ShmHelper {
  struct VramInfo {
    uint64_t total_vmem_size;
    uint64_t shareable_handle;
  };
-  VramInfo vram_info[MAX_DEVICES]; // support max 32 devices
+  VramInfo vram_info[MAX_DEVICES]; // support max 16 NPUs
  // GPU lock flag
  std::atomic<uint64_t> gpu_flag[MAX_DEVICES];
-  std::atomic<uint64_t> waiting_worker_flags[MAX_WORKERS];
+  // uint8_t _padding1[64 - sizeof(std::atomic<uint64_t>)];

  // request
  enum RequestType: uint32_t {
--- a/csrc/vnpu_offload/shm_manager.cpp
+++ b/csrc/vnpu_offload/shm_manager.cpp
@@ -55,7 +55,7 @@ void ShmManager::run_busy_loop() {
  while (!stop_loop_flag.load(std::memory_order_acquire)) {
    process_requests();

-    if (loop_cnt % heartbeat_check_everyN == 0) {
+    if (loop_cnt % heartbeat_check_everyN== 0) {
      check_heart_beats();
    }
    loop_cnt = (loop_cnt + 1) % heartbeat_check_everyN;
@@ -152,8 +152,6 @@ void ShmManager::check_heart_beats() {
        shm_helper->heart_beats[i].tgid = 0;
        shm_helper->heart_beats[i].timestamp.store(0,
                                                   std::memory_order_release);
-        // clear waiting flag
-        shm_helper->waiting_worker_flags[i].store(0, std::memory_order_release);
        // check dead lock
        for (int gpu_id : valid_gpu_ids) {
          uint64_t gpu_flag =
--- a/csrc/vnpu_offload/shm_worker.cpp
+++ b/csrc/vnpu_offload/shm_worker.cpp
@@ -1,29 +1,7 @@
 #include "shm_worker.h"

-static inline uint16_t get_shm_priority() {
-  const char *env_priority = getenv("VLLM_VNPU_PRIORITY");
-  if (env_priority) {
-    try {
-      int p = std::stoi(env_priority);
-      if (p >= 0 && p <= 7) {
-        return static_cast<uint16_t>(p);
-      } else {
-        spdlog::warn("VLLM_VNPU_PRIORITY should be between 0 and 7, got {}. Using default 0.", p);
-      }
-    } catch (...) {
-      spdlog::warn("Invalid VLLM_VNPU_PRIORITY format. Using default 0.");
-    }
-  }
-  return 0;
-}
-

 ShmWorker::ShmWorker() {
-  this->priority = get_shm_priority();
-  this->waiting_timestamp = 0;
-  this->is_waiting = false;
-  this->is_holding_lock = false;
-  spdlog::info("vNPU worker initialized with priority {}", priority);
  std::string shm_name = get_shm_name();
  int shm_fd = shm_open(shm_name.c_str(), O_RDWR, 0666);
  if (shm_fd == -1) {
@@ -62,18 +40,16 @@ bool ShmWorker::register_worker(int32_t tgid, int gpu_id,
  if (slot == -1) {
    return false;
  }
-  this->shm_slot = slot;

  *out_shareable_handle = shm_helper->vram_info[gpu_id].shareable_handle;
  *out_vmem_size = shm_helper->vram_info[gpu_id].total_vmem_size;

  stop_heart_beat.store(false, std::memory_order_release);
-  heart_beat_thread = std::thread(&ShmWorker::heart_beat_loop, this);
+  heart_beat_thread = std::thread(&ShmWorker::heart_beat_loop, this, slot);
  return true;
 }

-void ShmWorker::heart_beat_loop() {
-  int slot = this->shm_slot;
+void ShmWorker::heart_beat_loop(int slot) {
  while (!stop_heart_beat.load(std::memory_order_acquire)) {
    // update heart beat
    int32_t shm_tgid =
@@ -88,7 +64,6 @@ void ShmWorker::heart_beat_loop() {
        spdlog::error("TGID {} failed to re-register as worker", tgid);
        throw std::runtime_error("Failed to re-register as worker");
      }
-      this->shm_slot = slot;
    }
    uint64_t now = heartbeat_ts_us();
    shm_helper->heart_beats[slot].timestamp.store(now,
@@ -97,95 +72,32 @@ void ShmWorker::heart_beat_loop() {
  }
 }

-void ShmWorker::start_wait() {
-  if (is_waiting) return;  // Keep the older timestamp if already waiting
-
-  // Use lower 24 bits of millisecond timestamp
-  waiting_timestamp = static_cast<uint32_t>((heartbeat_ts_us() / 1000) & 0xFFFFFF);
-
-  uint64_t flag = pack_waiting_flag(this->gpu_id, this->priority, waiting_timestamp, this->tgid);
-  shm_helper->waiting_worker_flags[this->shm_slot].store(flag, std::memory_order_release);
-  is_waiting = true;
-}
-
-void ShmWorker::cancel_wait() {
-  if (!is_waiting) return;
-
-  shm_helper->waiting_worker_flags[this->shm_slot].store(0, std::memory_order_release);
-  is_waiting = false;
-}
-
-bool ShmWorker::has_higher_priority_waiter() {
-  for (int i = 0; i < MAX_WORKERS; ++i) {
-    if (i == this->shm_slot) continue;
-
-    uint64_t flag = shm_helper->waiting_worker_flags[i].load(std::memory_order_acquire);
-    if (flag == 0) continue;
-    if (unpack_waiting_device_id(flag) != this->gpu_id) continue;
-
-    uint16_t other_prio = unpack_waiting_priority(flag);
-
-    if (other_prio > this->priority) {
-      return true;  // Found a waiter with higher priority
-    } else if (other_prio == this->priority) {
-      if (this->is_holding_lock) {
-        // doesn't need to yield to same priority waiters
-        continue;
-      }
-      if (!this->is_waiting) {
-        // an earlier waiter with the same priority
-        return true;
-      }
-      uint32_t other_ts = unpack_waiting_timestamp_ms(flag);
-      // Same priority, compare timestamps (handle 24-bit wrap-around)
-      // Using 24-bit unsigned subtraction. If the difference is in the lower half,
-      // my timestamp is greater (i.e., I started waiting later).
-      uint32_t diff = (this->waiting_timestamp - other_ts) & 0xFFFFFF;
-      if (diff > 0 && diff < 0x800000) {
-        return true;  // The other worker started waiting earlier
-      } else if (diff == 0 && unpack_waiting_tgid(flag) < this->tgid) {
-        // using tgid if timestamps happen to be exactly the same
-        return true;
-      }
-    }
-  }
-  return false;
-}
-
 bool ShmWorker::try_lock_gpu(bool &out_self_hold) {
  static int retry_cnt = 0;

  uint64_t old_flag =
      shm_helper->gpu_flag[gpu_id].load(std::memory_order_acquire);
  if (unpack_lock_field(old_flag) == 0) { // free
-    // Check priority: yield if there are higher priority waiters, or same priority waiters who have waited longer.
-    if (has_higher_priority_waiter()) {
-      out_self_hold = false;
-      return false;
-    }
-
    uint64_t new_flag = pack_locked_tgid(tgid);
    if (shm_helper->gpu_flag[gpu_id].compare_exchange_weak(
            old_flag, new_flag, std::memory_order_acq_rel,
            std::memory_order_acquire)) {
-      // spdlog::info("TGID {} acquired GPU {} lock", tgid, gpu_id);
+      spdlog::info("TGID {} acquired GPU {} lock", tgid, gpu_id);
      int32_t prev_tgid = unpack_tgid_field(old_flag);
      out_self_hold = prev_tgid == tgid;
      retry_cnt = 0;
-      this->is_holding_lock = true;
      return true;
    }
  } else { // locked
    if (unpack_tgid_field(old_flag) == tgid) {
-      // spdlog::info("TGID {} already holds the GPU {} lock", tgid, gpu_id);
+      spdlog::info("TGID {} already holds the GPU {} lock", tgid, gpu_id);
      out_self_hold = true;
      retry_cnt = 0;
-      this->is_holding_lock = true;
      return true;
    }
  }
  // failed
-  if (++retry_cnt % 10000 == 0) {
+  if (++retry_cnt % 2000 == 0) {
    spdlog::info(
        "TGID {} trying to acquire GPU {} lock, current lock holder TGID {}",
        tgid, gpu_id, unpack_tgid_field(old_flag));
@@ -204,23 +116,19 @@ bool ShmWorker::lock_gpu(bool &out_self_hold) {
  }
 }

-void ShmWorker::unlock_gpu(bool keep_wait) {
-  if (!keep_wait) {
-    cancel_wait();
-  }
-
+void ShmWorker::unlock_gpu() {
  uint64_t old_flag =
      shm_helper->gpu_flag[gpu_id].load(std::memory_order_acquire);
  if (unpack_tgid_field(old_flag) != tgid) {
-    if (!keep_wait) {
-      spdlog::info("unlock: TGID {} does not hold GPU {} lock", tgid, gpu_id);
-    }
+    // spdlog::warn("previous gpu flag {} does not match expected locked flag for "
+    //              "TGID {}. This may be a bug, unless during startup.",
+    //              old_flag, tgid);
+    spdlog::info("TGID {} does not hold GPU {} lock", tgid, gpu_id);
  } else {
    uint64_t new_flag = pack_unlocked_tgid(tgid);
    shm_helper->gpu_flag[gpu_id].store(new_flag, std::memory_order_release);
-    // spdlog::info("TGID {} released GPU {} lock", tgid, gpu_id);
+    spdlog::info("TGID {} released GPU {} lock", tgid, gpu_id);
  }
-  this->is_holding_lock = false;
 }

 uint64_t ShmWorker::make_request(uint32_t type, uint64_t parameter) {
--- a/csrc/vnpu_offload/shm_worker.h
+++ b/csrc/vnpu_offload/shm_worker.h
@@ -17,21 +17,11 @@ class ShmWorker {

  bool try_lock_gpu(bool &out_self_hold);
  bool lock_gpu(bool &out_self_hold);
-  void unlock_gpu(bool keep_wait = false);
-
-  bool has_higher_priority_waiter();
-  void start_wait();
-  void cancel_wait();
+  void unlock_gpu();

 private:
  int32_t tgid;
  int gpu_id;
-  int shm_slot;
-  uint16_t priority;
-  uint32_t waiting_timestamp;
-  bool is_waiting;
-  bool is_holding_lock;
-
  ShmHelper *shm_helper;
  std::thread heart_beat_thread;
  std::atomic<bool> stop_heart_beat;
@@ -41,5 +31,5 @@ class ShmWorker {
  int register_worker_shm();

  // heart beat
-  void heart_beat_loop();
-};
+  void heart_beat_loop(int slot);
+};
--- a/csrc/vnpu_offload/vnpu_daemon.cpp
+++ b/csrc/vnpu_offload/vnpu_daemon.cpp
@@ -1,29 +1,25 @@
 #include <iostream>
 #include <sys/types.h>

-#include <atomic>
-#include <fcntl.h>
-#include <mutex>
-#include <signal.h>
-#include <string.h>
 #include <sys/mman.h>
 #include <sys/stat.h>
+#include <fcntl.h>
 #include <unistd.h>
+#include <string.h>
 #include <vector>
+#include <atomic>
+#include <mutex>
+#include <signal.h>

 #include "acl/acl.h"

-#include "npu_helper.h"
 #include "shm_manager.h"
+#include "npu_helper.h"
 #include "spdlog/spdlog.h"


 static ShmManager *shm_manager = nullptr;

-static inline double TO_GB(size_t bytes) {
-  return static_cast<double>(bytes) / (1024.0 * 1024.0 * 1024.0);
-}
-
 void handle_signal(int sig) {
  if (shm_manager) {
    shm_manager->stop_busy_loop();
@@ -53,7 +49,7 @@ size_t get_reserved_vram_size() {
        reserved_vram_size = size_gb * 1024 * 1024 * 1024;
      } catch (const std::exception &e) {
        spdlog::warn("Failed to parse VNPU_RESERVED_VRAM_SIZE_GB: {}, using "
-                     "default 8 GB",
+                     "default 8GB",
                     e.what());
      }
    }
@@ -72,13 +68,12 @@ void ensure_context(unsigned long long device) {
 }

 void init_acl() {
-  int32_t deviceId = 0;
+  int32_t deviceId=0;

  aclError ret = aclrtSetDevice(deviceId);
  if (ret != ACL_ERROR_NONE) {
-    throw std::runtime_error(
-        "aclrtSetDevice failed with acl error code: " + std::to_string(ret) +
-        " " + __FILE__ + ":" + std::to_string(__LINE__));
+    throw std::runtime_error("aclrtSetDevice failed with acl error code: " +
+                            std::to_string(ret) + " " + __FILE__ + ":" + std::to_string(__LINE__));
  }
 }

@@ -114,9 +109,8 @@ void alloc_physical(uint32_t device_id, aclrtDrvMemHandle &out_mem_handle,
    spdlog::error("aclrtGetMemInfo failed, error_code: {}", error_code);
    throw std::runtime_error("aclrtGetMemInfo failed");
  } else {
-    spdlog::info(
-        "aclrtGetMemInfo succeeded, free_mem: {:.2f} GB, total: {:.2f} GB",
-        TO_GB(free_mem), TO_GB(total));
+    spdlog::info("aclrtGetMemInfo succeeded, free_mem: {}, total: {}", free_mem,
+                 total);
  }

  aclrtPhysicalMemProp prop = {};
@@ -135,14 +129,13 @@ void alloc_physical(uint32_t device_id, aclrtDrvMemHandle &out_mem_handle,
                  error_code);
    throw std::runtime_error("aclrtMemGetAllocationGranularity failed");
  } else {
-    spdlog::info("aclrtMemGetAllocationGranularity succeeded, granularity: {} bytes",
+    spdlog::info("aclrtMemGetAllocationGranularity succeeded, granularity: {}",
                 granularity);
  }
  size_t reserved_mem_size = get_reserved_vram_size();
  if (free_mem < reserved_mem_size) {
-    spdlog::error(
-        "Not enough free memory to reserve: {:.2f} GB, free_mem: {:.2f} GB",
-        TO_GB(reserved_mem_size), TO_GB(free_mem));
+    spdlog::error("Not enough free memory to reserve: {}, free_mem: {}",
+                  reserved_mem_size, free_mem);
    throw std::runtime_error("Not enough free memory to reserve");
  }
  out_g_size = free_mem - reserved_mem_size;
@@ -154,8 +147,8 @@ void alloc_physical(uint32_t device_id, aclrtDrvMemHandle &out_mem_handle,
    spdlog::error("aclrtMallocPhysical failed, error_code: {}", error_code);
    throw std::runtime_error("aclrtMallocPhysical failed");
  } else {
-    spdlog::info("device {} aclrtMallocPhysical succeeded, size: {:.2f} GB",
-                 device_id, TO_GB(out_g_size));
+    spdlog::info("device {} aclrtMallocPhysical succeeded, size: {}", device_id,
+                 out_g_size);
  }
 }

--- a/docs/source/_templates/Model-Deployment-Tutorial-Template.md
+++ b/docs/source/_templates/Model-Deployment-Tutorial-Template.md
@@ -1,234 +0,0 @@
-# Deployment Tutorial Template Based on the XXX Model
-
-This template is based on deployment tutorials for models such as DeepSeek-V3.2 and Qwen-VL-Dense, and is intended to serve as a reference for technical documentation writing. Users can systematically construct relevant technical documentation by following the guidelines provided in this template.
-
-## 1 Introduction
-
-**Content Writing Requirements:**
-
- Provide a one-sentence description of the model's basic architecture, core features, and primary application scenarios.
- Provide a one-sentence description of the document's purpose and the objectives to be achieved.
- Specify the version of vLLM-Ascend used in the document and the version support status of the model.
-
-**Example 1: Model Introduction**
-
-DeepSeek-V3.2 is a sparse attention model. Its core architecture is similar to that of DeepSeek-V3.1, but it employs a sparse attention mechanism, aiming to explore and validate optimization solutions for training and inference efficiency in long-context scenarios.
-
-**Example 2: Document Purpose**
-
-This document will demonstrate the primary validation steps for the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, as well as accuracy and performance evaluation.
-
-**Example 3: Version Information**
-
-This document is validated and written based on **vLLM-Ascend v0.13.0**. The current model (XXX) is fully supported in this version, and all **v0.13.0 and later versions** can run stably. To use the latest features (e.g., PD separation, MTP), it is recommended to use v0.13.0 or a later version.
-
-## 2 Feature Matrix
-
-This section introduces the features supported by the model, including supported hardware, quantization methods, data parallelism, long-sequence features, etc.
-
-**Content Writing Requirements:**
-
- Present the support status of models and features in a table format.
- Alternatively, provide references with hyperlinks.
-
-**Example 1: Feature Support List**
-
-| Model Name | Support Status | Remarks | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Caching | LoRA | Speculative Decoding | Asynchronous Scheduling | Tensor Parallelism | Pipeline Parallelism | Expert Parallelism | Data Parallelism | Prefill-Decode Separation | Segmented ACL Graph Execution | Full ACL Graph Execution | Max Model Length | MLP Weight Prefetch | Documentation |
-| ------ | ---------- | ------ | ------ | ---------- | ------ | ------------ | -------------- | ------ | ---------- | ---------- | ---------- | ------------ | ---------- | ---------- | ------------------- | ----------- | ----------- | ------------- | ------------- | ---------- |
-| DeepSeek V3/3.1 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 240k | | [DeepSeek-V3.1](../../tutorials/models/DeepSeek-V3.1.md) |
-| DeepSeek V3.2 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ | [DeepSeek-V3.2](../../tutorials/models/DeepSeek-V3.2.md) |
-| DeepSeek R1 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 128k | | [DeepSeek R1](../../tutorials/models/DeepSeek-R1.md) |
-| Qwen3 | ✅ | | ✅ | Atlas 800I A2:<br>Minimum card requirement: xx | ✅ | ✅ | ✅ | | | ✅ | ✅ | | | ✅ | | ✅ | ✅ | 128k | ✅ | [Qwen3](../../tutorials/models/Qwen3-Dense.md) |
-
-**Note**: This is a simplified example. Please refer to the complete feature matrix for the full table.
-
-**Example 2: Reference Citation**
-
-Please refer to the [Supported Features List](../user_guide/support_matrix/supported_models.md) for the model support matrix.
-
-Please refer to the [Feature Guide](../user_guide/feature_guide/index.md) for feature configuration information.
-
-## 3 Environment Preparation
-
-### 3.1 Model Weight
-
-**Content Writing Requirements:** Describe the hardware resources, software environment, and model files required for deployment.
-
-**Example:**
-
-| Model Version | Hardware Requirements | Download Link |
-| ---------- | ---------- | ---------- |
-| DeepSeek-V3.2-Exp (BF16) | 2×Atlas 800 A3 (64G×16)<br>4×Atlas 800 A2 (64G×8) | [Model Weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16) |
-| DeepSeek-V3.2-Exp-w8a8 (Quantized) | 1×Atlas 800 A3 (64G×16)<br>2×Atlas 800 A2 (64G×8) | [Model Weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8) |
-| DeepSeek-V3.2-w8a8 (Quantized) | 1×Atlas 800 A3 (64G×16)<br>2×Atlas 800 A2 (64G×8) | [Model Weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/) |
-
-### 3.2 Verify Multi-node Communication (Optional)
-
-**Example:**
-
-If multi-node deployment is required, please follow the [Verify Multi-node Communication Environment](../installation.md#verify-multi-node-communication) guide for communication verification.
-
-## 4 Installation
-
-**Content Writing Requirements:**
-
- Provide specific steps and startup commands, covering both single-node and multi-node configurations.
- Provide explanations for parameters, including meaning, value range, and units.
- Specify the basic environment variables and communication environment variables that need to be enabled, with explanations including meaning, value range, and units.
- If the code example includes version numbers, it is necessary to add a comment explaining that the version number should be filled in according to the actual version in use.
-
-### 4.1 Docker Image Installation
-
-**Example:** Omitted
-
-### 4.2 Source Code Installation
-
-**Example:** Omitted
-
-## 5 Online Service Deployment
-
-### 5.1 Single-Node Online Deployment
-
-**Content Writing Requirements:**
-
- Describe the architectural characteristics and applicable scenarios of single-node deployment.
- Provide startup command templates and key parameter descriptions.
- Provide service verification methods.
-
-**Example:**
-
-Single-node deployment completes both Prefill and Decode within the same node, suitable for XXX scenarios.
-
-Startup Command:
-
-```bash
-# Omitted
-```
-
-Service Verification:
-
-```bash
-# Omitted
-```
-
-### 5.2 Multi-Node PD Separation Deployment
-
-**Content Writing Requirements:**
-
- Describe the principles of PD separation architecture and applicable scenarios.
- List prerequisites (network, storage, permissions).
- Provide script frameworks and key configuration item descriptions.
- Specify node role division and startup procedures.
- Indicate performance metrics.
-
-**Example:** Omitted
-
-### 5.3 Special Deployment Modes (Optional)
-
-**Content Writing Requirements:**
-
- If the model features non‑standard deployment modes (e.g., offline batch processing for embedding models, low‑latency online serving for reranker models), the corresponding deployment solutions must be explicitly documented.
- Section 5 "Online Service Deployment" provides examples for single‑node online service deployment and multi‑node PD‑separated deployment, which can be referenced and extended.
-
-## 6 Functional Verification
-
-**Content Writing Requirements:** Guide users on how to test the basic functionality of the model through simple interface calls after the service is started.
-
-**Example:**
-
-After the service is started, the model can be invoked by sending a prompt:
-
-```shell
-curl http://<node0_ip>:<port>/v1/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-        "model": "deepseek_v3.2",
-        "prompt": "The future of AI is",
-        "max_tokens": 50,
-        "temperature": 0
-    }'
-```
-
-## 7 Accuracy Evaluation
-
-**Content Writing Requirements:** Introduce standardized methods and tools for evaluating model output quality (accuracy). Two accuracy evaluation methods are provided below as examples; alternatively, provide direct links to existing documentation.
-
-### Using AISBench
-
-For details, please refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md).
-
-### Using Language Model Evaluation Harness
-
-Using the `gsm8k` dataset as an example test dataset, run the accuracy evaluation for `DeepSeek-V3.2-W8A8` in online mode.
-
-1. For `lm_eval` installation, please refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md).
-2. Run `lm_eval` to execute the accuracy evaluation.
-
-```shell
-lm_eval \
-  --model local-completions \
-  --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
-  --tasks gsm8k \
-  --output_path ./
-```
-
-## 8 Performance
-
-Omitted. Requirements are the same as for Accuracy Evaluation.
-
-## 9 Best Practices
-
-**Content Writing Requirements:**
-
-Provide recommended configurations for three scenarios (long sequence, low latency, high throughput) for each model that can achieve optimal performance, but do not provide specific performance data.
-
-## 10 Performance Tuning (Optional)
-
-**Content Writing Requirements:**
-
- Summarize key optimization techniques and parameter tuning experiences for the model to help users achieve optimal performance in specific scenarios. Include optimization technique descriptions, enablement methods, parameter tuning recommendations, and typical configuration examples.
- Hyperlinks to the features guide may be used to allow users to view detailed descriptions of specific features.
-
-### 10.1 Key Optimization Points
-
-In this section, we will introduce the key optimization points that can significantly improve the performance of the XX model. These techniques aim to improve throughput and efficiency in various scenarios.
-
-#### 10.1.1 Basic Optimizations
-
-**Example:**
-
-The following optimizations are enabled by default and require no additional configuration:
-
-| Optimization Technique | Technical Principle | Performance Benefit |
-| --------- | --------- | --------- |
-| Rope Optimization | The cos_sin_cache and indexing operations of positional encoding are executed only in the first layer, and subsequent layers reuse them directly | Reduces redundant computation during the decoding phase, accelerating inference |
-| AddRMSNormQuant Fusion | Merges address-wise multi-scale normalization and quantization operations into a single operator | Optimizes memory access patterns, improving computational efficiency |
-| Zero-like Elimination | Removes unnecessary zero-tensor operations in Attention forward pass | Reduces memory footprint, improves matrix operation efficiency |
-| FullGraph Optimization | Captures and replays the entire decoding graph at once using `compilation_config={"cudagraph_mode":"FULL_DECODE_ONLY"}` | Significantly reduces scheduling latency, stabilizes multi-device performance |
-
-#### 10.1.2 Advanced Optimizations (Require Explicit Enablement)
-
-**Example:**
-
-| Optimization Technique | Technical Principle | Enablement Method | Applicable Scenarios | Precautions |
-| --------- | --------- | --------- | --------- | --------- |
-| FlashComm_v1 | Decomposes traditional Allreduce into Reduce-Scatter and All-Gather, reducing RMSNorm computation dimensions | `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` | High-concurrency, Tensor Parallelism (TP) scenarios | Threshold protection: Only takes effect when the actual number of tokens exceeds the threshold to avoid performance degradation in low-concurrency scenarios |
-| Matmul-ReduceScatter Fusion | Fuses matrix multiplication and Reduce-Scatter operations to achieve pipelined parallel processing | Automatically enabled after enabling FlashComm_v1 | Large-scale distributed environments | Same as FlashComm_v1, has threshold protection |
-| Weight Prefetch | Utilizes vector computation time to prefetch MLP weights into L2 cache in advance | `export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1` | MLP-intensive scenarios (Dense models) | Requires coordination with prefetch buffer size adjustment |
-| Asynchronous Scheduling | Non-blocking task scheduling to improve concurrent processing capability | `--async-scheduling` | Large-scale models, high-concurrency scenarios | Should be used in coordination with FullGraph optimization |
-
-### 10.2 Optimization Highlights
-
-**Content Writing Requirements:**
-
-Summarize the most noteworthy optimization points during the actual tuning process, distill core experiences, and provide readers with tuning ideas for getting started quickly.
-
-**Example:**
-
-During the actual tuning process, the following points are most critical for performance improvement: The prefetch buffer size needs to be determined through empirical measurement to find the optimal overlap between computation and prefetching; the setting of `max-num-batched-tokens` needs to balance throughput and video memory to avoid excessive chunking or OOM risk; `cudagraph_capture_sizes` must be manually specified and cover the target concurrency; when FlashComm_v1 is enabled, it is also necessary to ensure that the values are multiples of TP; `pa_shape_list` is a temporary tuning parameter that only takes effect for specific batch sizes, requiring attention to version evolution for timely adjustments. The coordinated configuration of the above parameters and environment variables is key to achieving extreme performance.
-
-## 11 FAQ
-
-**Content Writing Requirements:**
-
- Provide solutions to common problems, including but not limited to problem phenomenon description, cause analysis, and solution measures.
--- a/docs/source/community/contributors.md
+++ b/docs/source/community/contributors.md
@@ -2,7 +2,7 @@

 ## Committers

-| Name | GitHub ID | Date |
+| Name | Github ID | Date |
 |:-----------:|:-----:|:-----:|
 | Xiyuan Wang | [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
 | Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
--- a/docs/source/community/governance.md
+++ b/docs/source/community/governance.md
@@ -14,7 +14,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author

 - Contributor:

-    **Responsibility:** Help new contributors onboarding, handle and respond to community questions, review RFCs and code.
+    **Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs and code.

    **Requirements:** Complete at least 1 contribution. A contributor is someone who consistently and actively participates in a project, including but not limited to issue/review/commits/community involvement.

@@ -37,7 +37,7 @@ Maintainers will be granted write access to the [vllm-project/vllm-ascend](https

 ### The Principles

- Membership in vLLM Ascend is given to individuals on a merit basis after they demonstrate their strong expertise in vLLM/vLLM Ascend through contributions, reviews, and discussions.
+- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrate their strong expertise in vLLM/vLLM Ascend through contributions, reviews, and discussions.

 - For membership in the maintainer group, individuals have to demonstrate strong and continued alignment with the overall vLLM/vLLM Ascend principles.

--- a/docs/source/community/user_stories/index.md
+++ b/docs/source/community/user_stories/index.md
@@ -10,7 +10,7 @@ Read case studies on how users and developers solve real, everyday problems with

 - [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2). See more GPUStack performance evaluation information at [this link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).

- [verl](https://github.com/volcengine/verl) is a flexible, efficient, and production-ready RL training library for LLMs. It uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0). See more information on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/quick_start/ascend_quick_start.html).
+- [verl](https://github.com/volcengine/verl) is a flexible, efficient, and production-ready RL training library for LLMs. It uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0). See more information on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).

 :::{toctree}
 :caption: More details
--- a/docs/source/community/user_stories/llamafactory.md
+++ b/docs/source/community/user_stories/llamafactory.md
@@ -4,7 +4,7 @@

 [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.

-LLaMA-Factory users need to evaluate the model and perform inference after fine-tuning.
+LLaMA-Factory users need to evaluate and inference the model after fine-tuning.

 **Business challenge**

--- a/docs/source/community/versioning_policy.md
+++ b/docs/source/community/versioning_policy.md
@@ -21,37 +21,35 @@ For example:

 The table below is the release compatibility matrix for vLLM Ascend release.

-| vLLM Ascend | vLLM              | Python          | Stable CANN |        PyTorch/torch_npu        |   Triton Ascend   | Mooncake |
-|-------------|-------------------|-----------------|-------------|---------------------------------|-------------------|----------|
-| v0.18.0     | v0.18.0           | >= 3.10, < 3.12 | 8.5.1       | 2.9.0  / 2.9.0.post1+git4c901a4 | 3.2.0.dev20260322 |  3.9.0   |
-| v0.18.0rc1  | v0.18.0           | >= 3.10, < 3.12 | 8.5.1       | 2.9.0  / 2.9.0.post1+git4c901a4 | 3.2.0.dev20260322 |  3.8.9   |
-| v0.17.0rc1  | v0.17.0           | >= 3.10, < 3.12 | 8.5.1       | 2.9.0  / 2.9.0                  | 3.2.0             |          |
-| v0.16.0rc1  | v0.16.0           | >= 3.10, < 3.12 | 8.5.1       | 2.9.0  / 2.9.0                  | 3.2.0             |          |
-| v0.15.0rc1  | v0.15.0           | >= 3.10, < 3.12 | 8.5.0       | 2.9.0  / 2.9.0                  | 3.2.0             |          |
-| v0.14.0rc1  | v0.14.1           | >= 3.10, < 3.12 | 8.5.0       | 2.9.0  / 2.9.0                  | 3.2.0             |          |
-| v0.13.0     | v0.13.0           | >= 3.10, < 3.12 | 8.5.0       | 2.9.0  / 2.8.0.post2            | 3.2.0             |          |
-| v0.13.0rc2  | v0.13.0           | >= 3.10, < 3.12 | 8.5.0       | 2.8.0  / 2.8.0.post1            | 3.2.0             |          |
-| v0.13.0rc1  | v0.13.0           | >= 3.10, < 3.12 | 8.3.RC2     | 2.8.0  / 2.8.0                  |                   |          |
-| v0.12.0rc1  | v0.12.0           | >= 3.10, < 3.12 | 8.3.RC2     | 2.8.0  / 2.8.0                  |                   |          |
-| v0.11.0     | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC2     | 2.7.1 / 2.7.1.post1             |                   |          |
-| v0.11.0rc3  | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC2     | 2.7.1 / 2.7.1.post1             |                   |          |
-| v0.11.0rc2  | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC2     | 2.7.1 / 2.7.1                   |                   |          |
-| v0.11.0rc1  | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC1     | 2.7.1 / 2.7.1                   |                   |          |
-| v0.11.0rc0  | v0.11.0rc3        | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |                   |          |
-| v0.10.2rc1  | v0.10.2           | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |                   |          |
-| v0.10.1rc1  | v0.10.1/v0.10.1.1 | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |                   |          |
-| v0.10.0rc1  | v0.10.0           | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |                   |          |
-| v0.9.2rc1   | v0.9.2            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250619 |                   |          |
-| v0.9.1      | v0.9.1            | >= 3.9, < 3.12  | 8.2.RC1     | 2.5.1 / 2.5.1.post1             |                   |          |
-| v0.9.1rc3   | v0.9.1            | >= 3.9, < 3.12  | 8.2.RC1     | 2.5.1 / 2.5.1.post1             |                   |          |
-| v0.9.1rc2   | v0.9.1            | >= 3.9, < 3.12  | 8.2.RC1     | 2.5.1 / 2.5.1.post1             |                   |          |
-| v0.9.1rc1   | v0.9.1            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250528 |                   |          |
-| v0.9.0rc2   | v0.9.0            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |                   |          |
-| v0.9.0rc1   | v0.9.0            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |                   |          |
-| v0.8.5rc1   | v0.8.5.post1      | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |                   |          |
-| v0.8.4rc2   | v0.8.4            | >= 3.9, < 3.12  | 8.0.0       | 2.5.1 / 2.5.1                   |                   |          |
-| v0.7.3.post1| v0.7.3            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |                   |          |
-| v0.7.3      | v0.7.3            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |                   |          |
+| vLLM Ascend | vLLM              | Python          | Stable CANN |        PyTorch/torch_npu        | Triton Ascend |
+|-------------|-------------------|-----------------|-------------|---------------------------------|---------------|
+| v0.17.0rc1  | v0.17.0           | >= 3.10, < 3.12 | 8.5.1       | 2.9.0  / 2.9.0                  | 3.2.0         |
+| v0.16.0rc1  | v0.16.0           | >= 3.10, < 3.12 | 8.5.1       | 2.9.0  / 2.9.0                  | 3.2.0         |
+| v0.15.0rc1  | v0.15.0           | >= 3.10, < 3.12 | 8.5.0       | 2.9.0  / 2.9.0                  | 3.2.0         |
+| v0.14.0rc1  | v0.14.1           | >= 3.10, < 3.12 | 8.5.0       | 2.9.0  / 2.9.0                  | 3.2.0         |
+| v0.13.0     | v0.13.0           | >= 3.10, < 3.12 | 8.5.0       | 2.9.0  / 2.8.0.post2            | 3.2.0         |
+| v0.13.0rc2  | v0.13.0           | >= 3.10, < 3.12 | 8.5.0       | 2.8.0  / 2.8.0.post1            | 3.2.0         |
+| v0.13.0rc1  | v0.13.0           | >= 3.10, < 3.12 | 8.3.RC2     | 2.8.0  / 2.8.0                  |               |
+| v0.12.0rc1  | v0.12.0           | >= 3.10, < 3.12 | 8.3.RC2     | 2.8.0  / 2.8.0                  |               |
+| v0.11.0     | v0.11.0           | >= 3.9, < 3.12 | 8.3.RC2     | 2.7.1 / 2.7.1.post1             |               |
+| v0.11.0rc3  | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC2     | 2.7.1 / 2.7.1.post1             |               |
+| v0.11.0rc2  | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC2     | 2.7.1 / 2.7.1                   |               |
+| v0.11.0rc1  | v0.11.0           | >= 3.9, < 3.12  | 8.3.RC1     | 2.7.1 / 2.7.1                   |               |
+| v0.11.0rc0  | v0.11.0rc3        | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |               |
+| v0.10.2rc1  | v0.10.2           | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |               |
+| v0.10.1rc1  | v0.10.1/v0.10.1.1 | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |               |
+| v0.10.0rc1  | v0.10.0           | >= 3.9, < 3.12  | 8.2.RC1     | 2.7.1 / 2.7.1.dev20250724       |               |
+| v0.9.2rc1   | v0.9.2            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250619 |               |
+| v0.9.1      | v0.9.1            | >= 3.9, < 3.12  | 8.2.RC1     | 2.5.1 / 2.5.1.post1             |               |
+| v0.9.1rc3   | v0.9.1            | >= 3.9, < 3.12  | 8.2.RC1     | 2.5.1 / 2.5.1.post1             |               |
+| v0.9.1rc2   | v0.9.1            | >= 3.9, < 3.12  | 8.2.RC1     | 2.5.1 / 2.5.1.post1             |               |
+| v0.9.1rc1   | v0.9.1            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1.post1.dev20250528 |               |
+| v0.9.0rc2   | v0.9.0            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |               |
+| v0.9.0rc1   | v0.9.0            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |               |
+| v0.8.5rc1   | v0.8.5.post1      | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |               |
+| v0.8.4rc2   | v0.8.4            | >= 3.9, < 3.12  | 8.0.0       | 2.5.1 / 2.5.1                   |               |
+| v0.7.3.post1| v0.7.3            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |               |
+| v0.7.3      | v0.7.3            | >= 3.9, < 3.12  | 8.1.RC1     | 2.5.1 / 2.5.1                   |               |

 :::{note}
 If you're using v0.7.3, don't forget to install [mindie-turbo](https://pypi.org/project/mindie-turbo) as well.
@@ -176,4 +174,4 @@ Notes:
 - `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
  every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
  The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in
-  vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in any vLLM Ascend version or branch.
+  vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version and branch.
--- a/docs/source/developer_guide/contribution/index.md
+++ b/docs/source/developer_guide/contribution/index.md
@@ -34,13 +34,14 @@ bash format.sh

 #### Run CI locally

-After completing "Run lint" setup, you can run CI (Continuous integration) locally:
+After completing "Run lint" setup, you can run CI locally:

 ```{code-block} bash
   :substitutions:
+
 cd ~/vllm-project/

-# Run CI needs vLLM installed
+# Run CI need vLLM installed
 git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
 cd vllm
 pip install -r requirements/build.txt
@@ -51,7 +52,7 @@ cd ..
 cd vllm-ascend
 # For Linux:
 pip install -r requirements-dev.txt
-# For non-Linux:
+# For non Linux:
 cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
 cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done

@@ -74,7 +75,7 @@ You can refer to [Testing](./testing.md)  to set up a testing environment and ru

 ## DCO and Signed-off-by

-When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO (Developer Certificate of Origin).
+When contributing changes to this project, you must agree to the DCO. Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO.

 Using `-s` with `git commit` will automatically add this header.

@@ -101,7 +102,7 @@ If the PR spans more than one category, please include all relevant prefixes.

 ## Others

-You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing).
+You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
 If you encounter any problems while contributing, feel free to submit a PR to improve the documentation to help other developers.

 :::{toctree}
--- a/docs/source/developer_guide/contribution/multi_node_test.md
+++ b/docs/source/developer_guide/contribution/multi_node_test.md
@@ -20,7 +20,7 @@ From the workflow perspective, we can see how the final test script is executed,

 2. Add config yaml

-    As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, a k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
+    As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, A k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).

    Suppose you have **2 nodes** running a 1P1D setup (1 Prefillers + 1 Decoder):

@@ -35,75 +35,77 @@ From the workflow perspective, we can see how the final test script is executed,
    npu_per_node: 16
    # All env vars you need should add it here
    env_common:
-      VLLM_USE_MODELSCOPE: true
-      OMP_PROC_BIND: false
-      OMP_NUM_THREADS: 100
-      HCCL_BUFFSIZE: 1024
-      SERVER_PORT: 8080
+    VLLM_USE_MODELSCOPE: true
+    OMP_PROC_BIND: false
+    OMP_NUM_THREADS: 100
+    HCCL_BUFFSIZE: 1024
+    SERVER_PORT: 8080
    disaggregated_prefill:
-      enabled: true
-      # node index(a list) which meet all the conditions:
-      #  - prefiller
-      #  - no headless(have api server)
-      prefiller_host_index: [0]
-      # node index(a list) which meet all the conditions:
-      #  - decoder
-      decoder_host_index: [1]
+    enabled: true
+    # node index(a list) which meet all the conditions:
+    #  - prefiller
+    #  - no headless(have api server)
+    prefiller_host_index: [0]
+    # node index(a list) which meet all the conditions:
+    #  - decoder
+    decoder_host_index: [1]

    # Add each node's vllm serve cli command just like you run locally
    # Add each node's individual envs like follow
    deployment:
-      - envs:
-          # fill with envs like: <key>:<value>
+    -
+        envs:
+            # fill with envs like: <key>:<value>
        server_cmd: >
-          vllm serve ...
-      - envs:
-          # fill with envs like: <key>:<value>
+            vllm serve ...
+    -
+        envs:
+            # fill with envs like: <key>:<value>
        server_cmd: >
-          vllm serve ...
+            vllm serve ...
    benchmarks:
-      perf:
+    perf:
        # fill with performance test kwargs
-      acc:
+    acc:
        # fill with accuracy test kwargs
    ```

 3. Add the case to nightly workflow

-Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
+Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)

-    ```yaml
+   ```yaml
    multi-node-tests:
-      name: multi-node
-      if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
-      strategy:
+        name: multi-node
+        if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
+        strategy:
        fail-fast: false
        max-parallel: 1
        matrix:
-          test_config:
+            test_config:
            - name: multi-node-deepseek-pd
-              config_file_path: DeepSeek-V3.yaml
-              size: 2
+                config_file_path: DeepSeek-V3.yaml
+                size: 2
            - name: multi-node-qwen3-dp
-              config_file_path: Qwen3-235B-A22B.yaml
-              size: 2
+                config_file_path: Qwen3-235B-A22B.yaml
+                size: 2
            - name: multi-node-qwenw8a8-2node
-              config_file_path: Qwen3-235B-W8A8.yaml
-              size: 2
+                config_file_path: Qwen3-235B-W8A8.yaml
+                size: 2
            - name: multi-node-qwenw8a8-2node-eplb
-              config_file_path: Qwen3-235B-W8A8-EPLB.yaml
-              size: 2
-      uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
-      with:
+                config_file_path: Qwen3-235B-W8A8-EPLB.yaml
+                size: 2
+        uses: ./.github/workflows/_e2e_nightly_multi_node.yaml
+        with:
        soc_version: a3
        runner: linux-aarch64-a3-0
        image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3'
        replicas: 1
        size: ${{ matrix.test_config.size }}
        config_file_path: ${{ matrix.test_config.config_file_path }}
-      secrets:
+        secrets:
        KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
-    ```
+   ```
  
 The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2.

@@ -111,142 +113,142 @@ The matrix above defines all the parameters required to add a multi-machine use

 ### 1. Use kubernetes

-This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. Then you can easily start our test with one click.
+This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. then you can easily start our test with one click.

 - Step 1. Install LWS CRD resources

    See <https://lws.sigs.k8s.io/docs/installation/> Which can be used as a reference

- Step 2. Deploy the following yaml file `lws.yaml` as needed
+- Step 2. Deploy the following yaml file `lws.yaml` as what you want

    ```yaml
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
-      name: test-server
-      namespace: vllm-project
+    name: test-server
+    namespace: vllm-project
    spec:
-      replicas: 1
-      leaderWorkerTemplate:
+    replicas: 1
+    leaderWorkerTemplate:
        size: 2
        restartPolicy: None
        leaderTemplate:
-          metadata:
+        metadata:
            labels:
-              role: leader
-          spec:
+            role: leader
+        spec:
            containers:
-              - name: vllm-leader
+            - name: vllm-leader
                imagePullPolicy: Always
                image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
                env:
-                  - name: CONFIG_YAML_PATH
+                - name: CONFIG_YAML_PATH
                    value: DeepSeek-V3.yaml
-                  - name: WORKSPACE
+                - name: WORKSPACE
                    value: "/vllm-workspace"
-                  - name: FAIL_TAG
+                - name: FAIL_TAG
                    value: FAIL_TAG
                command:
-                  - sh
-                  - -c
-                  - |
+                - sh
+                - -c
+                - |
                    bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
                resources:
-                  limits:
+                limits:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
-                  requests:
+                requests:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
                    cpu: 125
                ports:
-                  - containerPort: 8080
+                - containerPort: 8080
                # readinessProbe:
                #   tcpSocket:
                #     port: 8080
                #   initialDelaySeconds: 15
                #   periodSeconds: 10
                volumeMounts:
-                  - mountPath: /root/.cache
+                - mountPath: /root/.cache
                    name: shared-volume
-                  - mountPath: /usr/local/Ascend/driver/tools
+                - mountPath: /usr/local/Ascend/driver/tools
                    name: driver-tools
-                  - mountPath: /dev/shm
+                - mountPath: /dev/shm
                    name: dshm
            volumes:
-              - name: dshm
-                emptyDir:
-                  medium: Memory
-                  sizeLimit: 15Gi
-              - name: shared-volume
-                persistentVolumeClaim:
-                  claimName: nv-action-vllm-benchmarks-v2
-              - name: driver-tools
-                hostPath:
-                  path: /usr/local/Ascend/driver/tools
+            - name: dshm
+            emptyDir:
+                medium: Memory
+                sizeLimit: 15Gi
+            - name: shared-volume
+            persistentVolumeClaim:
+                claimName: nv-action-vllm-benchmarks-v2
+            - name: driver-tools
+            hostPath:
+                path: /usr/local/Ascend/driver/tools
        workerTemplate:
-          spec:
+        spec:
            containers:
-              - name: vllm-worker
+            - name: vllm-worker
                imagePullPolicy: Always
                image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3
                env:
-                  - name: CONFIG_YAML_PATH
+                - name: CONFIG_YAML_PATH
                    value: DeepSeek-V3.yaml
-                  - name: WORKSPACE
+                - name: WORKSPACE
                    value: "/vllm-workspace"
-                  - name: FAIL_TAG
+                - name: FAIL_TAG
                    value: FAIL_TAG
                command:
-                  - sh
-                  - -c
-                  - |
+                - sh
+                - -c
+                - |
                    bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh
                resources:
-                  limits:
+                limits:
                    huawei.com/ascend-1980: 16
                    memory: 512Gi
                    ephemeral-storage: 100Gi
-                  requests:
+                requests:
                    huawei.com/ascend-1980: 16
                    ephemeral-storage: 100Gi
                    cpu: 125
                volumeMounts:
-                  - mountPath: /root/.cache
+                - mountPath: /root/.cache
                    name: shared-volume
-                  - mountPath: /usr/local/Ascend/driver/tools
+                - mountPath: /usr/local/Ascend/driver/tools
                    name: driver-tools
-                  - mountPath: /dev/shm
+                - mountPath: /dev/shm
                    name: dshm
            volumes:
-              - name: dshm
-                emptyDir:
-                  medium: Memory
-                  sizeLimit: 15Gi
-              - name: shared-volume
-                persistentVolumeClaim:
-                  claimName: nv-action-vllm-benchmarks-v2
-              - name: driver-tools
-                hostPath:
-                  path: /usr/local/Ascend/driver/tools
+            - name: dshm
+            emptyDir:
+                medium: Memory
+                sizeLimit: 15Gi
+            - name: shared-volume
+            persistentVolumeClaim:
+                claimName: nv-action-vllm-benchmarks-v2
+            - name: driver-tools
+            hostPath:
+                path: /usr/local/Ascend/driver/tools
    ---
    apiVersion: v1
    kind: Service
    metadata:
-      name: vllm-leader
-      namespace: vllm-project
+    name: vllm-leader
+    namespace: vllm-project
    spec:
-      ports:
+    ports:
        - name: http
-          port: 8080
-          protocol: TCP
-          targetPort: 8080
-      selector:
+        port: 8080
+        protocol: TCP
+        targetPort: 8080
+    selector:
        leaderworkerset.sigs.k8s.io/name: vllm
        role: leader
-      type: ClusterIP
+    type: ClusterIP
    ```

    ```bash
--- a/docs/source/developer_guide/contribution/testing.md
+++ b/docs/source/developer_guide/contribution/testing.md
@@ -23,7 +23,7 @@ cd ~/vllm-project/
 # vllm  vllm-ascend

 # Use mirror to speed up download
-# docker pull m.daocloud.io/quay.io/ascend/cann:|cann_image_tag|
+# docker pull quay.nju.edu.cn/ascend/cann:|cann_image_tag|
 export IMAGE=quay.io/ascend/cann:|cann_image_tag|
 docker run --rm --name vllm-ascend-ut \
    -v $(pwd):/vllm-project \
@@ -40,7 +40,6 @@ export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirror
 # src path
 export SRC_WORKSPACE=/vllm-workspace
 mkdir -p $SRC_WORKSPACE
-cd $SRC_WORKSPACE

 apt-get update -y
 apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2
@@ -154,8 +153,8 @@ pip install -r requirements-dev.txt
 There are several principles to follow when writing unit tests:

 - The test file path should be consistent with the source file and start with the `test_` prefix, such as: `vllm_ascend/worker/worker.py` --> `tests/ut/worker/test_worker.py`
- The vLLM Ascend test uses unittest framework. See [the Python unittest documentation](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related functions on the host.
+- The vLLM Ascend test uses unittest framework. See [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
+- All unit tests can be run on CPUs, so you must mock the device-related function to host.
 - Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
 - You can run the unit tests using `pytest`:

@@ -206,8 +205,8 @@ pytest -sv tests/ut/test_ascend_config.py

 ### E2E test

-Although vllm-ascend CI provides E2E tests on Ascend CI (for example,
-[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), [schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), [pr_test_full.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/pr_test_full.yaml)), you can run them locally.
+Although vllm-ascend CI provides the [E2E test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
+locally.

 :::::{tab-set}
 :sync-group: e2e
--- a/docs/source/developer_guide/evaluation/using_ais_bench.md
+++ b/docs/source/developer_guide/evaluation/using_ais_bench.md
@@ -38,11 +38,11 @@ Run the vLLM server in the docker.

 ```{code-block} bash
   :substitutions:
-vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 &
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
 ```

 :::{note}
-`--max-model-len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
+`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
 :::

 The vLLM server is started successfully, if you see logs as below:
@@ -81,74 +81,74 @@ You can choose one or multiple datasets to execute accuracy evaluation.

 1. `C-Eval` dataset.

-    Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
+Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.

-    Download dataset and install it to specific path.
+Download dataset and install it to specific path.

-    ```shell
-    cd ais_bench/datasets
-    mkdir ceval/
-    mkdir ceval/formal_ceval
-    cd ceval/formal_ceval
-    wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
-    unzip ceval-exam.zip
-    rm ceval-exam.zip
-    ```
+```shell
+cd ais_bench/datasets
+mkdir ceval/
+mkdir ceval/formal_ceval
+cd ceval/formal_ceval
+wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
+unzip ceval-exam.zip
+rm ceval-exam.zip
+```

 2. `MMLU` dataset.

-    ```shell
-    cd ais_bench/datasets
-    wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
-    unzip mmlu.zip
-    rm mmlu.zip
-    ```
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
+unzip mmlu.zip
+rm mmlu.zip
+```

 3. `GPQA` dataset.

-    ```shell
-    cd ais_bench/datasets
-    wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
-    unzip gpqa.zip
-    rm gpqa.zip
-    ```
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
+unzip gpqa.zip
+rm gpqa.zip
+```

 4. `MATH` dataset.

-    ```shell
-    cd ais_bench/datasets
-    wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
-    unzip math.zip
-    rm math.zip
-    ```
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
+unzip math.zip
+rm math.zip
+```

 5. `LiveCodeBench` dataset.

-    ```shell
-    cd ais_bench/datasets
-    git lfs install
-    git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
-    ```
+```shell
+cd ais_bench/datasets
+git lfs install
+git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
+```

 6. `AIME 2024` dataset.

-    ```shell
-    cd ais_bench/datasets
-    mkdir aime/
-    cd aime/
-    wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
-    unzip aime.zip
-    rm aime.zip
-    ```
+```shell
+cd ais_bench/datasets
+mkdir aime/
+cd aime/
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
+unzip aime.zip
+rm aime.zip
+```

 7. `GSM8K` dataset.

-    ```shell
-    cd ais_bench/datasets
-    wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
-    unzip gsm8k.zip
-    rm gsm8k.zip
-    ```
+```shell
+cd ais_bench/datasets
+wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
+unzip gsm8k.zip
+rm gsm8k.zip
+```

 #### Configuration

@@ -161,7 +161,7 @@ There are several arguments that you should update according to your environment
 - `path`: Update to your model weight path.
 - `model`: Update to your model name in vLLM.
 - `host_ip` and `host_port`: Update to your vLLM server ip and port.
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max_model_len`(config in your vllm server), `32768` will be suitable for most datasets.
+- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
 - `batch_size`: Update according to your dataset.
 - `temperature`: Update inference argument.

--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@@ -29,7 +29,7 @@ docker run --rm \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
-vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
 ```

 If the vLLM server is started successfully, you can see information shown below:
@@ -65,7 +65,7 @@ pip install gradio plotly evalscope

 ## 3. Run GSM8K using EvalScope for accuracy testing

-You can use `evalscope eval` to run GSM8K (a grade-school math benchmark dataset) for accuracy testing:
+You can use `evalscope eval` to run GSM8K for accuracy testing:

 ```shell
 evalscope eval \
@@ -87,7 +87,7 @@ After 1 to 2 minutes, the output is shown below:
 +---------------------+-----------+-----------------+----------+-------+---------+---------+
 ```

-See more details in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
+See more detail in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).

 ## 4. Run model inference stress testing using EvalScope

--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@@ -32,7 +32,7 @@ docker run --rm \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
 /bin/bash
-vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 &
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
 ```

 The vLLM server is started successfully, if you see logs as below:
@@ -43,41 +43,33 @@ INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 ```

-### 2. Run GSM8K using the vLLM server (curl) and then run lm-eval for accuracy testing
+### 2. Run GSM8K using lm-eval for accuracy testing

 You can query the result with input prompts:

 ```shell
-PROMPT='<|im_start|>system
-You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>
-<|im_start|>user
-Question: A company'"'"'s balance sheet as of December 31, 2023 shows:
-  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan
-  Non-current assets: Net fixed assets 12 million yuan
-  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan
-  Non-current liabilities: Long-term loans 9 million yuan
-  Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ?
-Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places).
-Options:
-A. Asset-Liability Ratio=58.33%, Current Ratio=1.90
-B. Asset-Liability Ratio=62.50%, Current Ratio=2.17
-C. Asset-Liability Ratio=65.22%, Current Ratio=1.75
-D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>
-<|im_start|>assistant
-'
-
 curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
-    -d "$(jq -n \
-        --arg model "Qwen/Qwen2.5-0.5B-Instruct" \
-        --arg prompt "$PROMPT" \
-        '{
-            model: $model,
-            prompt: $prompt,
-            max_completion_tokens: 1,
-            temperature: 0,
-            stop: ["<|im_end|>"]
-        }')" | python3 -m json.tool
+    -d '{
+        "model": "Qwen/Qwen2.5-0.5B-Instruct",
+        "prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\
+"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\
+"  Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\
+"  Non-current assets: Net fixed assets 12 million yuan\n"\
+"  Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\
+"  Non-current liabilities: Long-term loans 9 million yuan\n"\
+"  Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\
+"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\
+"Options:\n"\
+"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\
+"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\
+"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
+"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
+"<|im_start|>assistant\n"'",
+        "max_completion_tokens": 1,
+        "temperature": 0,
+        "stop": ["<|im_end|>"]
+    }' | python3 -m json.tool
 ```

 The output format matches the following:
@@ -222,7 +214,7 @@ Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|

 ## Use Offline Datasets

-Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [using-local-datasets][2].
+Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [here][2].

 ```bash
 # set HF_DATASETS_OFFLINE when using offline datasets
--- a/docs/source/developer_guide/evaluation/using_opencompass.md
+++ b/docs/source/developer_guide/evaluation/using_opencompass.md
@@ -4,7 +4,7 @@ This document guides you to conduct accuracy testing using [OpenCompass](https:/

 ## 1. Online Server

-You can run a docker container to start the vLLM server on a single NPU:
+You can run docker container to start the vLLM server on a single NPU:

 ```{code-block} bash
   :substitutions:
@@ -29,7 +29,7 @@ docker run --rm \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
 -it $IMAGE \
-vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240
+vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
 ```

 The vLLM server is started successfully, if you see information as below:
@@ -53,7 +53,7 @@ curl http://localhost:8000/v1/completions \
    }'
 ```

-## 2. Run C-Eval (a Chinese language model evaluation benchmark) using OpenCompass for accuracy testing
+## 2. Run C-Eval using OpenCompass for accuracy testing

 Install OpenCompass and configure the environment variables in the container:

@@ -116,7 +116,7 @@ python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
 After 1 to 2 minutes, the output is shown below:

 ```shell
-The markdown format results are as below:
+The markdown format results is as below:

 | dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
 |----- | ----- | ----- | ----- | -----|
--- a/docs/source/developer_guide/Design_Documents/ACL_Graph.md
+++ b/docs/source/developer_guide/Design_Documents/ACL_Graph.md
@@ -25,7 +25,7 @@ device:                  | run op1 | run op2 | run op3 | run op4 | run op5 |

 ## How to use ACL Graph?

-ACL Graph is enabled by default in V1 Engine, you just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
+ACL Graph is enabled by default in V1 Engine, just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)

 ## How it works?

--- a/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md
+++ b/docs/source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md
@@ -4,44 +4,44 @@

 Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.

-However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses on-chip memory for KV cache storage.
+However, the performance gain from prefix caching is highly dependent on the cache hit rate, while the cache hit rate can be limited if one only uses HBM for KV cache storage.

-Hence, KV Cache Pool is proposed to utilize various types of storage including on-chip memory, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
+Hence, KV Cache Pool is proposed to utilize various types of storage including HBM, DRAM, and SSD, making a pool for KV Cache storage while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.

 vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake), one of the most recognized KV Cache storage engines.

-While one can utilize MooncakeStore in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports MooncakeStore and can utilize the data transfer strategy that best fits Huawei NPU hardware.
+While one can utilize Mooncake Store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports Mooncake Store and can utilize the data transfer strategy that best fits Huawei NPU hardware.

-Hence, we propose to integrate MooncakeStore with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).
+Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largely inspired by **LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 Implemented?` section).

 ## Usage

-vLLM Ascend currently supports MooncakeStore for KV Cache Pool. To enable MooncakeStore, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.
+vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to configure `kv-transfer-config` and choose `MooncakeStoreConnector` as the KV Connector.

 For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).

 ## How it works?

-The KV Cache Pool integrates multiple memory tiers (on-chip memory, DRAM, SSD, etc.) through a connector-based architecture.
+The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.

 Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.

-When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in on-chip memory) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
+When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.

-### 1. Combining KV Cache Pool with on-chip memory Prefix Caching
+### 1. Combining KV Cache Pool with HBM Prefix Caching

-Prefix Caching with on-chip memory is already supported by the vLLM V1 Engine.
-By introducing KV Connector V1, users can seamlessly combine on-chip memory-based Prefix Caching with Mooncake-backed KV Pool.
+Prefix Caching with HBM is already supported by the vLLM V1 Engine.
+By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.

- The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no-enable-prefix-caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector).
+ The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the `--no_enable_prefix_caching` flag is set, and setting up the KV Connector for KV Pool (e.g., the MooncakeStoreConnector).

 **Workflow**:

-1. The engine first checks for prefix hits in the on-chip memory cache.
+1. The engine first checks for prefix hits in the HBM cache.

-2. After getting the number of hit tokens on on-chip memory, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from on-chip memory to minimize the data transfer latency.
+2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector. If there are additional hits in the KV Pool, we get the **additional blocks only** from the KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.

-3. After the KV Caches in the KV Pool are loaded into on-chip memory, the remaining process is the same as Prefix Caching in on-chip memory.
+3. After the KV Caches in the KV Pool are loaded into HBM, the remaining process is the same as Prefix Caching in HBM.

 ### 2. Combining KV Cache Pool with Mooncake PD Disaggregation

@@ -49,9 +49,9 @@ When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cach

 Currently, we only perform put and get operations of KV Pool for **Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e., MooncakeConnector.

-The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from on-chip memory and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.
+The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes, while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfers KV Caches between NPU devices directly.

-To enable this feature, we need to set up both Mooncake Connector and MooncakeStore Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.
+To enable this feature, we need to set up both Mooncake Connector and Mooncake Store Connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in a specific order.

 For details, please also refer to the Mooncake Connector Store Deployment Guide.

@@ -86,6 +86,6 @@ The KV Connector methods that need to be implemented can be categorized into sch

 ## Limitations

-1. Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
+1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.

 2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. We will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that).
--- a/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md
+++ b/docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md
@@ -35,7 +35,7 @@ The workflow of obtaining inputs:

 At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.

-**Note**: The `Token IDs` are the inputs of a model, so we also call them `Input IDs`.
+**Note**: The `Token IDs` are the inputs of a model, so we also call them `Inputs IDs`.

 ### 2. Build inputs attention metadata

@@ -217,19 +217,19 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
 1. `request indices`: `[0, 1, 2, 2, 2]`
 2. `token positions`: `[3, 2, 5, 6, 7]`

-      Current **Token IDs table**:
+Current **Token IDs table**:

-      ```shell
-      | T_0_0 | T_0_1 | T_0_2 | T_0_3 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
-      | T_1_0 | T_1_1 | T_1_2 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
-      | T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 |   ?   |   ?   |   ?   |   ?   |
-      |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
-      ......
-      ......
-      ......
-      ```
+```shell
+| T_0_0 | T_0_1 | T_0_2 | T_0_3 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+| T_1_0 | T_1_1 | T_1_2 |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 |   ?   |   ?   |   ?   |   ?   |
+|   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |   ?   |
+......
+......
+......
+```

-      **Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
+**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.

 3. `token indices`: `[3, 14, 29, 30, 31]`
 4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
@@ -281,6 +281,6 @@ Scheduled token count: `[1, 1, 3]`

 ## At last

-If you understand step 1 and step 2, you will know all the following steps.
+If you understand step_1 and step_2, you will know all the following steps.

 Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.
--- a/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md
+++ b/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md
@@ -16,10 +16,10 @@ enable_custom_op()

 ## How to add a custom aclnn operation?

- Create a new operation folder under `csrc` directory.
- Create `op_host` and `op_kernel` directories for host and kernel source code.
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS="op1;op2;op3"`.
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
+- Create a new operation folder under `csrc` directory
+- Create `op_host` and `op_kernel` directories for host and kernel source code
+- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`
+- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`
+- Write a meta implementation in `csrc/torch_binding_meta.cpp` for op being captured into aclgraph

 After a successful build of vllm-ascend, the custom aclnn operation can be invoked in python code.
--- a/docs/source/developer_guide/Design_Documents/context_parallel.md
+++ b/docs/source/developer_guide/Design_Documents/context_parallel.md
--- a/docs/source/developer_guide/Design_Documents/cpu_binding.md
+++ b/docs/source/developer_guide/Design_Documents/cpu_binding.md
@@ -20,7 +20,7 @@ On multi‑socket ARM systems, the OS scheduler may place vLLM threads on CPUs f
  | Device type | Default mode | Description |
  | ----------- | ------------ | ------------ |
  | A3 (No Affinity) | `global_slice` | Splits the allowed CPU list evenly based on the **total number of global logical NPUs**, ensuring each NPU is assigned a contiguous segment of CPU cores. This prevents CPU core overlap across multiple process groups. |
-  | A2 / Atlas 300 inference products / Others | `topo_affinity` | Allocates CPUs based on NPU topology affinity (`npu‑smi info -t topo`). If multiple NPUs are assigned to a single NUMA node (which may cause bandwidth contention), the CPU allocation extends to adjacent NUMA nodes. |
+  | A2 / 310P / Others | `topo_affinity` | Allocates CPUs based on NPU topology affinity (`npu‑smi info -t topo`). If multiple NPUs are assigned to a single NUMA node (which may cause bandwidth contention), the CPU allocation extends to adjacent NUMA nodes. |

    - **Default**: enabled (enable_cpu_binding = true).
    - **Fallback**: If NPU topo affinity is unavailable, global_slice is used.
@@ -36,7 +36,7 @@ On multi‑socket ARM systems, the OS scheduler may place vLLM threads on CPUs f
   - Read cpuset from /proc/self/status.
   - Read topo affinity from `npu‑smi info -t topo`.
 4. **Build CPU pools**:
-   - Use **global_slice** for A3 devices; **topo_affinity** for A2 and Atlas 300 inference products.
+   - Use **global_slice** for A3 devices; **topo_affinity** for A2 and 310P.
   - If topo affinity is missing, fall back to global_slice.
   - Ensure each NPU has at least 5 CPUs.
 5. **Allocate per‑role CPUs**:
@@ -156,7 +156,7 @@ With the current `global_slice` strategy, some CPU/NPU layouts cannot avoid cros
 |2|10-14|`IRQ`: 10-11, `Main`: 12, `ACL`: 13, `Release`: 14|
 |3|15-19|`IRQ`: 15-16, `Main`: 17, `ACL`: 18, `Release`: 19|

-### Example 5: A2/Atlas 300 inference products topo_affinity with NUMA extension
+### Example 5: A2/310P topo_affinity with NUMA extension

 **Inputs**:

@@ -201,7 +201,7 @@ To resolve, either reduce total_npus or enlarge the cpuset so that each NPU has
 - Logs show the selected binding mode and the allocation plan, for example:
    - `[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`
    - `The CPU allocation plan is as follows: ...`
- You can verify affinity via taskset or `/proc/<pid>/status` after startup.
+- You can verify affinity via taskset or /proc/<pid>/status after startup.

 ## Limitations & Notes

--- a/docs/source/developer_guide/Design_Documents/disaggregated_prefill.md
+++ b/docs/source/developer_guide/Design_Documents/disaggregated_prefill.md
@@ -8,7 +8,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
   Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.

 2. **Optimizing TPOT**
-   Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system's **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
+   Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.

 ---

@@ -20,7 +20,7 @@ vLLM Ascend currently supports two types of connectors for handling KV cache man
 - **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.  

 For step-by-step deployment and configuration, refer to the following guide:  
-[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
+[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)

 ---

@@ -28,7 +28,7 @@ For step-by-step deployment and configuration, refer to the following guide:

 ### 1. Design Approach

-Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key-value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
+Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key–value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.

 ### 2. Implementation Design

@@ -38,19 +38,19 @@ Our design diagram is shown below, illustrating the pull and push schemes respec

 #### Mooncake Connector

-1. The request is sent to the Proxy's `_handle_completions` endpoint.
+1. The request is sent to the Proxy’s `_handle_completions` endpoint.
 2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
-3. After the P node's scheduler finishes prefill, `update_from_output` invokes the schedule connector's `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
+3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
 4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
 5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.

 #### Mooncake Layerwise Connector

-1. The request is sent to the Proxy's `_handle_completions` endpoint.
+1. The request is sent to the Proxy’s `_handle_completions` endpoint.
 2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
 3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
-4. The Proxy's `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
-5. During processing, the P node's scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
+4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
+5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
 6. The D node performs decoding and returns the result.

 ### 3. Interface Design
@@ -63,7 +63,7 @@ Taking MooncakeConnector as an example, the system is organized into three prima

 ### 4. Specifications Design

-This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving equal TP setups and certain unequal TP setups across multiple P and D nodes.
+This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes.

 | Feature                       |      Status    |
 |-------------------------------|----------------|
--- a/docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md
+++ b/docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md
@@ -100,28 +100,28 @@ If you want to add a new eplb policy to vllm_ascend, you must follow these steps
 1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py`  and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
 For example:

-    ```python
-    class RandomLoadBalance(EplbPolicy):
+```python
+class RandomLoadBalance(EplbPolicy):

-        def __init__(self, config: DynamicConfig):
-            super().__init__(config)
+    def __init__(self, config: DynamicConfig):
+        super().__init__(config)

-        def rebalance_experts(self, current_expert_table, expert_workload):
-            new_table = copy.deepcopy(current_expert_table)
-            num_layers = len(current_expert_table)
+    def rebalance_experts(self, current_expert_table, expert_workload):
+        new_table = copy.deepcopy(current_expert_table)
+        num_layers = len(current_expert_table)

-            for i in range(num_layers):
-                # randomly choose two card
-                # indices = random.sample(range(num_card), 2)
-                indices = [3, 1]
+        for i in range(num_layers):
+            # randomly choose two card
+            # indices = random.sample(range(num_card), 2)
+            indices = [3, 1]

-                # swap redundant experts
-                expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
-                new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
-                new_table[i][indices[1]][-1] = expert_id_to_exchange
+            # swap redundant experts
+            expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
+            new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
+            new_table[i][indices[1]][-1] = expert_id_to_exchange

-            return 1, [-i for i in range(num_layers)], new_table
-    ```
+        return 1, [-i for i in range(num_layers)], new_table
+```

 2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.

@@ -236,7 +236,7 @@ All method arguments must specify parameter types and default values, and functi

 #### Expert Map

-The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified of which ranks have inconsistent maps.
+The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
 During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.

 #### Expert Weight
--- a/docs/source/developer_guide/Design_Documents/index.md
+++ b/docs/source/developer_guide/Design_Documents/index.md
@@ -1,9 +1,9 @@
-# Design Documents
+# Feature Guide

 This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.

 :::{toctree}
-:caption: Design Documents
+:caption: Feature Guide
 :maxdepth: 1
 patch
 cpu_binding
--- a/docs/source/developer_guide/Design_Documents/npugraph_ex.md
+++ b/docs/source/developer_guide/Design_Documents/npugraph_ex.md
@@ -1,101 +1,101 @@
-# Npugraph_ex
-
-## How Does It Work?
-
-This is an optimization based on FX graphs, which can be considered an acceleration solution for the aclgraph mode.
-
-You can get its code [code](https://gitcode.com/Ascend/torchair)
-
-## Default FX Graph Optimization
-
-### FX Graph pass
-
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
-
-### FX fusion pass
-
-npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
-
-Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
-
-You can get the default [fusion pass list](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
-
-## Custom fusion pass
-
-Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
-
-Below is the declaration of this API and a demo of its usage.
-
-```python
-register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
-```
-
-|Parameter Name| Input/Output |Explanation|Is necessary|
-|--|--------------|---|-------|
-|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
-|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
-|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
-|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
-|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
-|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
-
-Usage Example
-
-```python
-import functools
-import torch, torch_npu, torchair
-
-from torch._inductor.pattern_matcher import Match
-from torch._subclasses.fake_tensor import FakeTensorMode
-from torchair.core.utils import logger
-
-# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
-# Define a search_fn to find the operator combinations in the original FX graph before fusion.
-def search_fn(x1, x2, gamma):
-    xOut = torch.add(x1, x2)
-    y, _ = torch_npu.npu_rms_norm(xOut, gamma)
-    return y, xOut
-
-# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
-def replace_fn(x1, x2, gamma):
-    y, _, xOut = torch_npu.npu_add_rms_norm(
-        x1, x2, gamma
-    )
-    return y, xOut
-
-# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
-def extra_check(match: Match):
-    x1 = match.kwargs.get("x1")
-
-    if x1 is None:
-        return False 
-    if not hasattr(x1, "meta") or "val" not in x1.meta:
-        return False
-
-    a_shape = x1.meta["val"].shape
-    return a_shape[-1] == 7168 
-
-
-# Define some sample inputs to trace search_fn and replace_fn into an FX graph
-fake_mode = FakeTensorMode()
-with fake_mode:
-    # sizes/values don't actually matter for initial trace
-    # once we get a possible match we re-trace with the actual values and verify the match still holds
-    input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
-    kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
-
-    # Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
-    torchair.register_replacement(
-        search_fn=search_fn,
-        replace_fn=replace_fn,
-        example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
-        extra_check=extra_check
-    )
-```
-
-The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
-
-### DFX
-
-By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.
+# Npugraph_ex
+
+## How Does It Work?
+
+This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
+
+You can get its code [here](https://gitcode.com/Ascend/torchair)
+
+## Default Fx Graph Optimization
+
+### Fx Graph pass
+
+- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
+- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
+
+### Fx fusion pass
+
+npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
+
+Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
+
+You can get the default fusion pass list [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
+
+## Custom fusion pass
+
+Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
+
+Below is the declaration of this API and a demo of its usage.
+
+```python
+register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
+```
+
+|Parameter Name| Input/Output |Explanation|Is necessary|
+|--|--------------|---|-------|
+|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
+|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
+|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
+|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
+|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
+|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
+
+Usage Example
+
+```python
+import functools
+import torch, torch_npu, torchair
+
+from torch._inductor.pattern_matcher import Match
+from torch._subclasses.fake_tensor import FakeTensorMode
+from torchair.core.utils import logger
+
+# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
+# Define a search_fn to find the operator combinations in the original FX graph before fusion.
+def search_fn(x1, x2, gamma):
+    xOut = torch.add(x1, x2)
+    y, _ = torch_npu.npu_rms_norm(xOut, gamma)
+    return y, xOut
+
+# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
+def replace_fn(x1, x2, gamma):
+    y, _, xOut = torch_npu.npu_add_rms_norm(
+        x1, x2, gamma
+    )
+    return y, xOut
+
+# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
+def extra_check(match: Match):
+    x1 = match.kwargs.get("x1")
+
+    if x1 is None:
+        return False 
+    if not hasattr(x1, "meta") or "val" not in x1.meta:
+        return False
+
+    a_shape = x1.meta["val"].shape
+    return a_shape[-1] == 7168 
+
+
+# Define some sample inputs to trace search_fn and replace_fn into an FX graph
+fake_mode = FakeTensorMode()
+with fake_mode:
+    # sizes/values don't actually matter for initial trace
+    # once we get a possible match we re-trace with the actual values and verify the match still holds
+    input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
+    kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
+
+    # Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
+    torchair.register_replacement(
+        search_fn=search_fn,
+        replace_fn=replace_fn,
+        example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
+        extra_check=extra_check
+    )
+```
+
+The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
+
+### DFX
+
+By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.
--- a/docs/source/developer_guide/Design_Documents/patch.md
+++ b/docs/source/developer_guide/Design_Documents/patch.md
@@ -60,7 +60,7 @@ Before writing a patch, following the principle above, we should patch the least
    #   1. `<The target patch module in vLLM>`
    #    Why:
    #       <Describe the reason why we need to patch>
-    #    How:
+    #    How：
    #       <Describe the way to patch>
    #    Related PR (if no, explain why):
    #       <Add a link to the related PR in vLLM. If there is no related PR, explain why>
@@ -72,5 +72,5 @@ Before writing a patch, following the principle above, we should patch the least

 ## Limitations

-1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
-2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for that version (e.g., v0.9.n) should work.
+1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
+2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
--- a/docs/source/developer_guide/Design_Documents/quantization.md
+++ b/docs/source/developer_guide/Design_Documents/quantization.md
@@ -24,7 +24,7 @@ The `embedding` method is generally not implemented for quantization, focusing o

 The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.

-We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **MoE (Mixture of Experts)**).
+We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).

 **Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:

@@ -42,7 +42,7 @@ We need to implement the `create_weights`, `process_weights_after_loading`, and
    "model.layers.0.mlp.gate.weight": "FLOAT",
    "model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
    "model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
-    "model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC"
+    "model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
 }
 ```

@@ -54,7 +54,7 @@ Based on the above content, we present a brief description of the adaptation pro
 - **Step 2: Registration**. Use the `@register_scheme` decorator in `vllm_ascend/quantization/methods/registry.py` to register your quantization scheme class.

 ```python
-from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme, AscendMoEScheme
+from vllm_ascend.quantization.methods import register_scheme, AscendLinearScheme

@register_scheme("W4A8_DYNAMIC", "linear")
 class AscendW4A8DynamicLinearMethod(AscendLinearScheme):
@@ -107,7 +107,7 @@ vLLM Ascend supports multiple quantization algorithms. The following table provi
 | `W8A8_DYNAMIC`           | INT8   | INT8       | Per-Channel        | Per-Token              | Dynamic | Dynamic activation quantization with per-token scaling factor calculation                                                                                          |
 | `W4A8_DYNAMIC`           | INT4   | INT8       | Per-Group          | Per-Token              | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit)                                    |
 | `W4A4_FLATQUANT_DYNAMIC` | INT4   | INT4       | Per-Channel        | Per-Token              | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation          |
-| `W8A8_MIX`               | INT8   | INT8       | Per-Channel        | Per-Tensor/Token       | Mixed   | We support two deployment modes: PD Colocation (dynamic quantization for both P and D) and PD Disaggregation (dynamic-quant P and static-quant D) |
+| `W8A8_MIX`               | INT8   | INT8       | Per-Channel        | Per-Tensor/Token       | Mixed   | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |

 **Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.

--- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
@@ -332,6 +332,7 @@ An L0 `dump.json` contains forward I/O for modules together with parameters. Usi
     "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt"
    }
   }
+  },
  }
 }
 }
@@ -388,6 +389,7 @@ An L1 `dump.json` records forward I/O for APIs. Using PyTorch's `relu` function
     "data_name": "Functional.relu.0.forward.output.0.pt"
    }
   ]
+  },
  }
 }
 }  
--- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
@@ -55,7 +55,7 @@ pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
 pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pytest

 # Configure this var to speed up model download
-export VLLM_USE_MODELSCOPE=True
+VLLM_USE_MODELSCOPE=true
 ```

 Please follow the [Installation Guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) to make sure vLLM and vllm-ascend are installed correctly.
@@ -85,7 +85,7 @@ wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz

 # Configure python and pip
 cp ./*.so* /usr/local/lib
-tar -zxvf ./py311_bisheng.tar.gz -C /usr/local/
+tar -zxvf ./py311_bisheng.*  -C /usr/local/
 mv  /usr/local/py311_bisheng/  /usr/local/python
 sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
 sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
@@ -111,12 +111,12 @@ sudo apt update
 sudo apt install libjemalloc2

 # Configure jemalloc
-export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
+export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
 ```

 #### 2.2. Tcmalloc

-**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
+**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).

 ```{code-block} bash
   :substitutions:
@@ -158,12 +158,6 @@ Scheduling optimization:
   :substitutions:
 # Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight.
 export TASK_QUEUE_ENABLE=2
-```
-
-or
-
-```{code-block} bash
-   :substitutions:

 # This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
 export CPU_AFFINITY_CONF=1
@@ -184,10 +178,10 @@ export HCCL_OP_EXPANSION_MODE="AIV"

 Plus, there are more features for performance optimization in specific scenarios, which are shown below.

- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
+- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
+- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
+- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
+- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).

 ### 5. OS Optimization

--- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
+++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md
@@ -4,12 +4,6 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua

 **Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).

-**Legend Description**:
-
- ✅ = Supported
- 🟡 = Partial / Work in progress
- 🚧 = Under development
-  
 ## 1. Run docker container

 ```{code-block} bash
@@ -97,8 +91,7 @@ For local `dataset-path`, please set `hf-name` to its Hugging Face ID like
 First start serving your model:

 ```bash
-export VLLM_USE_MODELSCOPE=True 
-vllm serve Qwen/Qwen3-8B
+VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
 ```

 Then run the benchmarking script:
@@ -159,7 +152,7 @@ vllm bench throughput \
 If successful, you will see the following output

 ```shell
-Processed prompts: 100%|█| 10/10 [00:03<00:00,  2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 toks/s]
+Processed prompts: 100%|█| 10/10 [00:03<00:00,  2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
 Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
 Total num prompt tokens:  1280
 Total num output tokens:  1280
@@ -223,7 +216,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
 ```shell
 # download dataset
 # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-export VLLM_USE_MODELSCOPE=True
+export VLLM_USE_MODELSCOPE=true
 vllm bench serve \
  --model Qwen/Qwen3-Embedding-8B \
  --backend openai-embeddings \
--- a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
@@ -88,7 +88,7 @@ Navigate to the `./vllm_profile` directory and locate the generated `*ascend_pt`

 ```python
 from torch_npu.profiler.profiler import analyse
-analyse("./vllm_profile/localhost.localdomain_*_ascend_pt/")
+analyse("./vllm_profile/localhost.localdomain_XXXXXXXXXX_ascend_pt/")
 ```

 ### 5. View Results
--- a/docs/source/faqs.md
+++ b/docs/source/faqs.md
@@ -108,13 +108,17 @@ If all above steps are not working, feel free to submit a GitHub issue.

 ### 8. Does vllm-ascend support Prefill Disaggregation feature?

-Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. See the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html) for example.
+Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. See the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) for example.

 ### 9. Does vllm-ascend support quantization method?

 Currently, w8a8, w4a8, and w4a4 quantization methods are already supported by vllm-ascend.

-### 10. How is vllm-ascend tested?
+### 10. How to run a W8A8 DeepSeek model?
+
+Follow the [inference tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html) and replace the model with DeepSeek.
+
+### 11. How is vllm-ascend tested?

 vllm-ascend is tested in three aspects: functions, performance, and accuracy.

@@ -128,25 +132,25 @@ vllm-ascend is tested in three aspects: functions, performance, and accuracy.

 For each release, we'll publish the performance test and accuracy test report in the future.

-### 11. How to fix the error "InvalidVersion" when using vllm-ascend?
+### 12. How to fix the error "InvalidVersion" when using vllm-ascend?

 The problem is usually caused by the installation of a development or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.

-### 12. How to handle the out-of-memory issue?
+### 13. How to handle the out-of-memory issue?

-OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory).
+OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).

-In scenarios where NPUs have limited high bandwidth memory (on-chip memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
+In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:

- **Limit `--max-model-len`**: It can save the on-chip memory usage for KV cache initialization step.
+- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step.

- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization).
+- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).

 - **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).

-### 13. Failed to enable NPU graph mode when running DeepSeek
+### 14. Failed to enable NPU graph mode when running DeepSeek

-Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA (Multi-Head Latent Attention) and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.
+Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.

 And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, `num_heads`/`num_kv_heads` is {32, 64, 128}.

@@ -155,54 +159,54 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
 [rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
 ```

-### 14. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend
+### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend

 You may encounter the problem of C/C++ compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.

-### 15. How to generate deterministic results when using vllm-ascend?
+### 16. How to generate deterministic results when using vllm-ascend?

 There are several factors that affect output determinism:

 1. Sampler method: using **greedy sampling** by setting `temperature=0` in `SamplingParams`, e.g.:

-   ```python
-   from vllm import LLM, SamplingParams
+```python
+from vllm import LLM, SamplingParams

-   prompts = [
-      "Hello, my name is",
-      "The president of the United States is",
-      "The capital of France is",
-      "The future of AI is",
-   ]
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]

-   # Create a sampling params object.
-   sampling_params = SamplingParams(temperature=0)
-   # Create an LLM.
-   llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0)
+# Create an LLM.
+llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

-   # Generate texts from the prompts.
-   outputs = llm.generate(prompts, sampling_params)
-   for output in outputs:
-      prompt = output.prompt
-      generated_text = output.outputs[0].text
-      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-   ```
+# Generate texts from the prompts.
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```

 2. Set the following environment parameters:

-   ```bash
-   export LCCL_DETERMINISTIC=1
-   export HCCL_DETERMINISTIC=true
-   export ATB_MATMUL_SHUFFLE_K_ENABLE=0
-   export ATB_LLM_LCOC_ENABLE=0
-   ```
+```bash
+export LCCL_DETERMINISTIC=1
+export HCCL_DETERMINISTIC=true
+export ATB_MATMUL_SHUFFLE_K_ENABLE=0
+export ATB_LLM_LCOC_ENABLE=0
+```

-### 16. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model？
+### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model？

 The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met, run `pip install qwen-omni-utils`.
 This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring that the audio processing functionality works correctly.

-### 17. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
+### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?

 ```shell
 error example in detail: 
@@ -218,11 +222,11 @@ Recommended mitigation strategies:
 Root cause analysis:
 The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream-overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.

-### 18. How to install custom version of torch_npu?
+### 19. How to install custom version of torch_npu?

 torch-npu will be overridden  when installing vllm-ascend. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after vllm-ascend is installed.

-### 19. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error
+### 20. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error

 On certain operating systems, such as Kylin OS, you may encounter an `invalid tar header` error during the `docker pull` process:

@@ -249,24 +253,24 @@ This is often due to system compatibility issues. You can resolve this by using

 Copy the `vllm_ascend_<tag>.tar` file (where `<tag>` is the image tag you used) to your target machine

-### 20. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted"
+### 21. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted"

 When using `--shm-size`, you may need to add the `--privileged=true` flag to your `docker run` command to grant the container necessary permissions. Please be aware that using `--privileged=true` grants the container extensive privileges on the host system, which can be a security risk. Only use this option if you understand the implications and trust the container's source.

-### 21. How to achieve low latency in a small batch scenario?
+### 22. How to achieve low latency in a small batch scenario?

 The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenarios is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it using the following instruction:

 ```bash
 bash tools/install_flash_infer_attention_score_ops_a2.sh
-# change to run the following instruction if you're using A3 machine
+## change to run the following instruction if you're using A3 machine
 # bash tools/install_flash_infer_attention_score_ops_a3.sh
 ```

 **NOTE**: Don't set `additional_config.pa_shape_list` when using this method; otherwise, it will lead to another attention operator.
 **Important**: Please make sure you're using the **official image** of `vllm-ascend`; otherwise, you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own, or create one. If you're not the root user, you need `sudo` **privileges** to run this script.

-### 22. How to set `SOC_VERSION` when building from source on a CPU-only machine?
+### 23. How to set `SOC_VERSION` when building from source on a CPU-only machine?

 When building from source (e.g. `pip install -e .`), the build may try to infer the target chip via `npu-smi`. If `npu-smi` is not available (common in CPU-only build environments), you must set `SOC_VERSION` manually before installation.

@@ -286,7 +290,7 @@ export SOC_VERSION="ascend310p1"
 export SOC_VERSION="<value starting with ascend950>"
 ```

-### 23. Compilation error occasionally encounters with triton-ascend
+### 24. Compilation error occasionally encounters with triton-ascend

 As shown in [#7782](https://github.com/vllm-project/vllm-ascend/issues/7782), triton-ascend occasionally encounters compilation errors, which is a known issue in triton-ascend 3.2.0. To avoid this issue, please use the official docker images or install the specific triton-ascend version as following:

@@ -296,13 +300,3 @@ ARCH=$(python3 -c "import platform; machine = platform.machine().lower(); arch_m
 TRITON_ASCEND_WHEEL="triton_ascend-3.2.0.dev20260322-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_27_${ARCH}.manylinux_2_28_${ARCH}.whl" && \
 python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${TRITON_ASCEND_WHEEL}"
 ```
-
-### 24. Why TPOT increases drastically as concurrency grows?
-
-When testing a vLLM server, one may find that TPOT increases as concurrency increases (for example, TPOT increases by 0.5 ~ 1ms when concurrency increases by 4). This phenomenon is normal in most cases. However, sometimes TPOT may increase dramatically (10 to 100ms for example) as concurrency grows. This is possibly caused by [**PREEMPTION**](https://docs.vllm.ai/en/latest/configuration/optimization/#preemption) in vLLM.
-Generally, when your server hits KV cache limits, vLLM tries to free KV cache of requests to ensure sufficient space for other requests, which is called preemption in vLLM. When a request is preempted, the default behavior is to recompute the KV cache of this request again in the future, which is why the performance might drop significantly. There are several ways to verify this:
-
- vLLM usually logs stats on your server. You might see metrics like `GPU KV cache usage: 99.0%,`. When reaching 100%, it triggers preemption.
- When launching a vLLM server, you will see logs like `GPU KV cache size: 66340 tokens` and `Maximum concurrency for 16,384 tokens per request: 4.05`. These are estimated KV cache capacity for a single DP group. You can adjust the overall request traffic according to this.
-
-Preemption cannot be avoided completely since KV cache usage always has a limit. But there are methods to reduce the chances of preemption. As is suggested in [**PREEMPTION**](https://docs.vllm.ai/en/latest/configuration/optimization/#preemption), the core strategy is to increase available KV cache. For example, one can increase `--gpu-memory-utilization` or decrease `--max-num-seqs` && `--max-num-batched-tokens`.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -57,7 +57,7 @@ user_guide/release_notes
 :caption: Developer Guide
 :maxdepth: 1
 developer_guide/contribution/index
-developer_guide/Design_Documents/index
+developer_guide/feature_guide/index
 developer_guide/evaluation/index
 developer_guide/performance_and_debug/index
 :::
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -11,7 +11,7 @@ This document describes how to install vllm-ascend manually.

    | Software      | Supported version                | Note                                      |
    |---------------|----------------------------------|-------------------------------------------|
-    | Ascend HDK    | Refer to the documentation [CANN 8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html) | Required for CANN |
+    | Ascend HDK    | Refer to the documentation [here](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html) | Required for CANN |
    | CANN          | == 8.5.1                        | Required for vllm-ascend and torch-npu    |
    | torch-npu     | == 2.9.0             | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
    | torch         | == 2.9.0                          | Required for torch-npu and vllm           |
@@ -128,7 +128,7 @@ sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
 apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
 # Or using yum
 # yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
-# Config pip mirror,only versions 0.11.0 and earlier are supported, if using a version later than 0.11.0, do not execute this command
+# Config pip mirror
 pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
 ```

@@ -244,7 +244,7 @@ docker run --rm \
    -it $IMAGE bash
 ```

-The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developers immediately make changes without requiring a new installation.
+The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) (`pip install -e`) to help developer immediately take place changes without requiring a new installation.

 ## Extra information

@@ -284,7 +284,7 @@ python example.py
 If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative:

 ```bash
-export VLLM_USE_MODELSCOPE=True
+export VLLM_USE_MODELSCOPE=true
 pip install modelscope
 python example.py
 ```
--- a/docs/source/llms.txt
+++ b/docs/source/llms.txt
@@ -36,7 +36,7 @@ Documentation: https://docs.vllm.ai/projects/ascend/en/latest/index.html

 ## Developer Guide
 - [Contribution Guide](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html): setup and contribution workflow.
- [Developer Design Documents](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/index.html): developer-side feature docs.
+- [Developer Feature Guide](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/index.html): developer-side feature docs.
 - [Evaluation Guide](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/index.html): evaluation workflow.
 - [Performance and Debug Index](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/index.html): profiling and debugging docs.

--- a/docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po
--- a/docs/source/locale/zh_CN/LC_MESSAGES/community/governance.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/community/governance.po
@@ -4,197 +4,201 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/community/governance.md:1
+#: ../../community/governance.md:1
 msgid "Governance"
 msgstr "治理"

-#: ../../source/community/governance.md:3
+#: ../../community/governance.md:3
 msgid "Mission"
 msgstr "使命"

-#: ../../source/community/governance.md:5
+#: ../../community/governance.md:4
 msgid ""
 "As a vital component of vLLM, the vLLM Ascend project is dedicated to "
-"providing an easy, fast, and cheap LLM Serving for everyone on Ascend "
-"NPUs and to actively contributing to the enrichment of vLLM."
+"providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, "
+"and to actively contribute to the enrichment of vLLM."
 msgstr ""
-"作为 vLLM 的重要组成部分，vLLM Ascend 项目致力于为所有人在昇腾 NPU "
-"上提供简单、快速且低成本的大语言模型服务，并积极为丰富 vLLM 生态系统做出贡献。"
+"作为 vLLM 的重要组成部分，vLLM Ascend 项目致力于为所有人在 Ascend NPU 上提供简单、快速且低成本的大语言模型服务，并积极促进"
+" vLLM 的丰富发展。"

-#: ../../source/community/governance.md:7
+#: ../../community/governance.md:6
 msgid "Principles"
 msgstr "原则"

-#: ../../source/community/governance.md:9
+#: ../../community/governance.md:7
 msgid ""
-"vLLM Ascend follows the vLLM community's code of conduct: [vLLM - CODE OF"
-" CONDUCT](https://github.com/vllm-"
-"project/vllm/blob/main/CODE_OF_CONDUCT.md)"
+"vLLM Ascend follows the vLLM community's code of conduct：[vLLM - CODE OF "
+"CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)"
 msgstr ""
 "vLLM Ascend 遵循 vLLM 社区的行为准则：[vLLM - 行为准则](https://github.com/vllm-"
 "project/vllm/blob/main/CODE_OF_CONDUCT.md)"

-#: ../../source/community/governance.md:11
+#: ../../community/governance.md:9
 msgid "Governance - Mechanics"
 msgstr "治理 - 机制"

-#: ../../source/community/governance.md:13
+#: ../../community/governance.md:10
 msgid ""
-"vLLM Ascend is an open-source project under the vLLM community, where the"
-" authority to appoint roles is ultimately determined by the vLLM "
-"community. It adopts a hierarchical technical governance structure."
+"vLLM Ascend is an open-source project under the vLLM community, where the "
+"authority to appoint roles is ultimately determined by the vLLM community. "
+"It adopts a hierarchical technical governance structure."
 msgstr "vLLM Ascend 是 vLLM 社区下的一个开源项目，其角色任命权最终由 vLLM 社区决定。它采用分层的技术治理结构。"

-#: ../../source/community/governance.md:15
+#: ../../community/governance.md:12
 msgid "Contributor:"
 msgstr "贡献者："

-#: ../../source/community/governance.md:17
+#: ../../community/governance.md:14
 msgid ""
-"**Responsibility:** Help new contributors onboarding, handle and respond "
-"to community questions, review RFCs and code."
-msgstr "**职责：** 帮助新贡献者加入，处理和回复社区问题，审查 RFC 和代码。"
+"**Responsibility:** Help new contributors on boarding, handle and respond to"
+" community questions, review RFCs, code"
+msgstr "**职责：** 帮助新贡献者加入，处理和回复社区问题，审查RFC和代码"

-#: ../../source/community/governance.md:19
+#: ../../community/governance.md:16
 msgid ""
-"**Requirements:** Complete at least 1 contribution. A contributor is "
-"someone who consistently and actively participates in a project, "
-"including but not limited to issue/review/commits/community involvement."
-msgstr "**要求：** 完成至少 1 次贡献。贡献者是指持续且积极参与项目的人，包括但不限于提交问题、进行评审、提交代码和参与社区活动。"
+"**Requirements:** Complete at least 1 contribution. Contributor is someone "
+"who consistently and actively participates in a project, included but not "
+"limited to issue/review/commits/community involvement."
+msgstr "**要求：** 完成至少1次贡献。贡献者是指持续且积极参与项目的人，包括但不限于问题、评审、提交和社区参与。"

-#: ../../source/community/governance.md:21
+#: ../../community/governance.md:18
 msgid ""
-"The contributor permissions are granted by the [vllm-project/vllm-"
-"ascend](https://github.com/vllm-project/vllm-ascend)'s repo `Triage` on "
-"GitHub, including repo read and clone, issue and PR management, "
-"facilitating efficient collaboration between community developers."
+"Contributors will be empowered [vllm-project/vllm-"
+"ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` "
+"permissions (`Can read and clone this repository. Can also manage issues and"
+" pull requests`) to help community developers collaborate more efficiently."
 msgstr ""
-"贡献者将被授予 [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-"
-"ascend) GitHub 仓库的 `Triage` 权限（包括仓库读取和克隆、问题和拉取请求管理），以促进社区开发者之间的高效协作。"
+"贡献者将被赋予 [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-"
+"ascend) Github 仓库的 `Triage` 权限（`可读取和克隆此仓库。还可以管理问题和拉取请求`），以帮助社区开发者更加高效地协作。"

-#: ../../source/community/governance.md:23
+#: ../../community/governance.md:20
 msgid "Maintainer:"
 msgstr "维护者："

-#: ../../source/community/governance.md:25
+#: ../../community/governance.md:22
 msgid ""
-"**Responsibility:** Develop the project's vision and mission. Maintainers"
-" are responsible for shaping the technical direction of the project and "
-"ensuring its long-term success. With code merge permissions, they lead "
-"roadmap planning, review community contributions, make ongoing code "
-"improvements, and actively participate in community engagement—such as "
-"regular meetings and events."
+"**Responsibility:** Develop the project's vision and mission. Maintainers "
+"are responsible for driving the technical direction of the entire project "
+"and ensuring its overall success, possessing code merge permissions. They "
+"formulate the roadmap, review contributions from community members, "
+"continuously contribute code, and actively engage in community activities "
+"(such as regular meetings/events)."
 msgstr ""
-"**职责：** "
-"制定项目的愿景和使命。维护者负责引领项目的技术方向并确保其长期成功，拥有代码合并权限。他们制定路线图，审核社区贡献，持续改进代码，并积极参与社区活动（如定期会议和活动）。"
+"**责任：** "
+"制定项目的愿景和使命。维护者负责引领整个项目的技术方向并确保其整体成功，拥有代码合并权限。他们制定路线图，审核社区成员的贡献，持续贡献代码，并积极参与社区活动（如定期会议/活动）。"

-#: ../../source/community/governance.md:27
+#: ../../community/governance.md:24
 msgid ""
-"**Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ code "
-"bases, with a commitment to sustained code contributions and competency "
-"in ‌design, development, and PR review workflows‌."
-msgstr "**要求：** 深入理解 ‌vLLM‌ 和 ‌vLLM Ascend‌ 代码库，承诺持续贡献代码，并具备 ‌设计、开发和 PR 审核工作流‌ 的能力。"
+"**Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases, "
+"with a commitment to sustained code contributions. Competency in "
+"‌design/development/PR review workflows‌."
+msgstr ""
+"**要求：** 深入理解 ‌vLLM‌ 和 ‌vLLM Ascend‌ 代码库，并承诺持续贡献代码。具备 ‌设计/开发/PR 审核流程‌ 的能力。"

-#: ../../source/community/governance.md:29
+#: ../../community/governance.md:25
 msgid ""
-"**Review quality‌:** Actively participate in community code reviews, "
+"**Review Quality‌:** Actively participate in community code reviews, "
 "ensuring high-quality code integration."
 msgstr "**评审质量：** 积极参与社区代码评审，确保高质量的代码集成。"

-#: ../../source/community/governance.md:30
+#: ../../community/governance.md:26
 msgid ""
-"**Quality contribution‌:** Successfully develop and deliver at least one "
+"**Quality Contribution‌:** Successfully develop and deliver at least one "
 "major feature while maintaining consistent high-quality contributions."
-msgstr "**质量贡献：** 成功开发并交付至少一个主要功能，同时保持持续的高质量贡献。"
+msgstr "**质量贡献‌：** 成功开发并交付至少一个主要功能，同时持续保持高质量的贡献。"

-#: ../../source/community/governance.md:31
+#: ../../community/governance.md:27
 msgid ""
-"**Community involvement‌:** Actively address issues, respond to forum "
-"inquiries, participate in discussions, and engage in community-driven "
-"tasks."
-msgstr "**社区参与：** 积极解决问题，回复论坛询问，参与讨论，并投身于社区驱动的任务。"
+"**Community Involvement‌:** Actively address issues, respond to forum "
+"inquiries, participate in discussions, and engage in community-driven tasks."
+msgstr "**社区参与：** 积极解决问题，回复论坛询问，参与讨论，并参与社区驱动的任务。"

-#: ../../source/community/governance.md:33
+#: ../../community/governance.md:29
 msgid ""
-"The approval from existing Maintainers is required. The vLLM community "
-"has the final decision-making authority. Maintainers will be granted "
-"write access to the [vllm-project/vllm-ascend](https://github.com/vllm-"
-"project/vllm-ascend) GitHub repo. This includes permission to read, "
-"clone, and push to the repository, as well as manage issues and pull "
-"requests."
+"Requires approval from existing Maintainers. The vLLM community has the "
+"final decision-making authority."
+msgstr "需要现有维护者的批准。vLLM社区拥有最终决策权。"
+
+#: ../../community/governance.md:31
+msgid ""
+"Maintainer will be empowered [vllm-project/vllm-"
+"ascend](https://github.com/vllm-project/vllm-ascend) Github repo write "
+"permissions (`Can read, clone, and push to this repository. Can also manage "
+"issues and pull requests`)."
 msgstr ""
-"需要获得现有维护者的批准。vLLM 社区拥有最终决策权。维护者将被授予对 [vllm-project/vllm-"
-"ascend](https://github.com/vllm-project/vllm-ascend) GitHub 仓库的写入权限。这包括读取、克隆和推送仓库的权限，以及管理问题和拉取请求的权限。"
+"维护者将被授予 [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-"
+"ascend) Github 仓库的写入权限（`可以读取、克隆和推送到此仓库。还可以管理问题和拉取请求`）。"

-#: ../../source/community/governance.md:36
+#: ../../community/governance.md:33
 msgid "Nominating and Removing Maintainers"
 msgstr "提名和移除维护者"

-#: ../../source/community/governance.md:38
+#: ../../community/governance.md:35
 msgid "The Principles"
 msgstr "原则"

-#: ../../source/community/governance.md:40
+#: ../../community/governance.md:37
 msgid ""
-"Membership in vLLM Ascend is given to individuals on a merit basis after "
-"they demonstrate their strong expertise in vLLM/vLLM Ascend through "
-"contributions, reviews, and discussions."
+"Membership in vLLM Ascend is given to individuals on merit basis after they "
+"demonstrated strong expertise of the vLLM / vLLM Ascend through "
+"contributions, reviews and discussions."
 msgstr ""
 "vLLM Ascend 的成员资格是基于个人能力授予的，只有在通过贡献、评审和讨论展示出对 vLLM / vLLM Ascend "
 "的深厚专业知识后，才可获得。"

-#: ../../source/community/governance.md:42
+#: ../../community/governance.md:39
 msgid ""
-"For membership in the maintainer group, individuals have to demonstrate "
-"strong and continued alignment with the overall vLLM/vLLM Ascend "
+"For membership in the maintainer group the individual has to demonstrate "
+"strong and continued alignment with the overall vLLM / vLLM Ascend "
 "principles."
 msgstr "要成为维护者组成员，个人必须表现出与 vLLM / vLLM Ascend 总体原则的高度一致并持续支持。"

-#: ../../source/community/governance.md:44
+#: ../../community/governance.md:41
 msgid ""
-"Maintainers who have been inactive for a long time may be transitioned to"
-" **emeritus** status under lenient criteria."
-msgstr "长期不活跃的维护者，可根据宽松的标准转为 **荣誉** 状态。"
+"Light criteria of moving module maintenance to ‘emeritus’ status if they "
+"don’t actively participate over long periods of time."
+msgstr "如果模块维护人员在长时间内没有积极参与，可根据较宽松的标准将其维护状态转为“荣誉”状态。"

-#: ../../source/community/governance.md:46
+#: ../../community/governance.md:43
 msgid "The membership is for an individual, not a company."
-msgstr "该成员资格属于个人，而非公司。"
+msgstr "该会员资格属于个人，而非公司。"

-#: ../../source/community/governance.md:48
+#: ../../community/governance.md:45
 msgid "Nomination and Removal"
 msgstr "提名与罢免"

-#: ../../source/community/governance.md:50
+#: ../../community/governance.md:47
 msgid ""
-"Nomination: Anyone can nominate a candidate to become a maintainer, "
-"including self-nominations. All existing maintainers are responsible for "
-"reviewing and evaluating each nomination. The nominator should provide "
-"relevant information about the nominee's qualifications—such as review "
-"quality, quality contribution, and community involvement—among other "
-"strengths."
-msgstr "提名：任何人都可以提名候选人成为维护者（包括自荐）。所有现有维护者都有责任审查和评估每项提名。提名人应提供被提名人的相关资格信息，例如评审质量、质量贡献和社区参与度等优势。"
+"Nomination: Anyone can nominate someone to become a maintainer (include "
+"self-nominate). All existing maintainers are responsible for evaluating the "
+"nomination. The nominator should provide nominee's info around the strength "
+"of the candidate to be a maintainer, include but not limited to review "
+"quality, quality contribution, community involvement."
+msgstr ""
+"提名：任何人都可以提名他人成为维护者（包括自荐）。所有现有维护者都有责任评估提名。提名人应提供被提名人成为维护者的相关优势信息，包括但不限于评审质量、优质贡献、社区参与等。"

-#: ../../source/community/governance.md:51
+#: ../../community/governance.md:48
 msgid ""
-"Removal: Anyone may nominate an individual for removal from the "
-"maintainer role, including self-nominations. All current maintainers are "
-"responsible for reviewing and evaluating such nominations. The nominator "
-"should provide relevant information about the nominee—such as prolonged "
-"inactivity, misalignment with the project's overall direction, or other "
-"factors that may render them unsuitable for the maintainer position."
-msgstr "移除：任何人都可以提名某人从维护者角色中移除（包括自荐）。所有现任维护者都有责任审查和评估此类提名。提名人应提供被提名人的相关信息，例如长期不活跃、与项目整体方向不一致，或其他可能使其不适合担任维护者职位的因素。"
+"Removal: Anyone can nominate a person to be removed from maintainer position"
+" (include self-nominate). All existing maintainers are responsible for "
+"evaluating the nomination. The nominator should provide nominee's info, "
+"include but not limited to lack of activity, conflict with the overall "
+"direction and other information that makes them unfit to be a maintainer."
+msgstr ""
+"移除：任何人都可以提名某人被移出维护者职位（包括自荐）。所有现有维护者都有责任评估该提名。提名者应提供被提名人的相关信息，包括但不限于缺乏活动、与整体方向冲突以及使其不适合作为维护者的其他信息。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/index.po
@@ -4,98 +4,100 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/community/user_stories/index.md:15
+#: ../../community/user_stories/index.md:15
 msgid "More details"
-msgstr "更多详情"
+msgstr "更多细节"

-#: ../../source/community/user_stories/index.md:1
+#: ../../community/user_stories/index.md:1
 msgid "User Stories"
-msgstr "用户案例"
+msgstr "用户故事"

-#: ../../source/community/user_stories/index.md:3
+#: ../../community/user_stories/index.md:3
 msgid ""
-"Read case studies on how users and developers solve real, everyday "
-"problems with vLLM Ascend"
-msgstr "阅读案例研究，了解用户和开发者如何利用 vLLM Ascend 解决实际日常问题。"
+"Read case studies on how users and developers solves real, everyday problems"
+" with vLLM Ascend"
+msgstr "阅读案例研究，了解用户和开发者如何使用 vLLM Ascend 解决实际日常问题。"

-#: ../../source/community/user_stories/index.md:5
+#: ../../community/user_stories/index.md:5
 msgid ""
-"[LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient "
-"platform for training and fine-tuning large language models. It supports "
-"vLLM Ascend to speed up inference since [LLaMA-"
-"Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), "
-"gaining 2x performance enhancement in inference."
+"[LLaMA-Factory](./llamafactory.md) is an easy-to-use and efficient platform "
+"for training and fine-tuning large language models, it supports vLLM Ascend "
+"to speed up inference since [LLaMA-"
+"Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739), gain 2x "
+"performance enhancement of inference."
 msgstr ""
-"[LLaMA-Factory](./llamafactory.md) 是一个易于使用且高效的大语言模型训练与微调平台。自 "
-"[LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739) 起支持 "
-"vLLM Ascend 以加速推理，推理性能提升 2 倍。"
+"[LLaMA-Factory](./llamafactory.md) 是一个易于使用且高效的大语言模型训练与微调平台，自 [LLaMA-"
+"Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739) 起支持 vLLM "
+"Ascend 加速推理，推理性能提升 2 倍。"

-#: ../../source/community/user_stories/index.md:7
+#: ../../community/user_stories/index.md:7
 msgid ""
 "[Huggingface/trl](https://github.com/huggingface/trl) is a cutting-edge "
 "library designed for post-training foundation models using advanced "
-"techniques like SFT, PPO and DPO. It uses vLLM Ascend since "
+"techniques like SFT, PPO and DPO, it uses vLLM Ascend since "
 "[v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) to "
-"support RLHF on Ascend NPUs."
+"support RLHF on Ascend NPU."
 msgstr ""
-"[Huggingface/trl](https://github.com/huggingface/trl) 是一个前沿的库，专为使用 SFT、PPO 和 DPO "
-"等先进技术对基础模型进行后训练而设计。自 "
-"[v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) 起，该库使用 "
-"vLLM Ascend 以支持在昇腾 NPU 上进行 RLHF。"
+"[Huggingface/trl](https://github.com/huggingface/trl) 是一个前沿的库，专为使用 SFT、PPO 和"
+" DPO 等先进技术对基础模型进行后训练而设计。从 "
+"[v0.17.0](https://github.com/huggingface/trl/releases/tag/v0.17.0) 版本开始，该库利用"
+" vLLM Ascend 来支持在 Ascend NPU 上进行 RLHF。"

-#: ../../source/community/user_stories/index.md:9
+#: ../../community/user_stories/index.md:9
 msgid ""
-"[MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference"
-" engine acceleration plugin library developed by Huawei on Ascend "
-"hardware, which includes self-developed LLM optimization algorithms and "
-"optimizations related to the inference engine framework. It supports vLLM"
-" Ascend since "
-"[2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev"
-"/mindie-turbo-0001.html)."
+"[MindIE Turbo](https://pypi.org/project/mindie-turbo) is an LLM inference "
+"engine acceleration plug-in library developed by Huawei on Ascend hardware, "
+"which includes self-developed large language model optimization algorithms "
+"and optimizations related to the inference engine framework. It supports "
+"vLLM Ascend since "
+"[2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-"
+"turbo-0001.html)."
 msgstr ""
 "[MindIE Turbo](https://pypi.org/project/mindie-turbo) "
-"是华为在昇腾硬件上开发的一款用于加速大语言模型推理引擎的插件库，包含自主研发的大语言模型优化算法及与推理引擎框架相关的优化。自 "
-"[2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev"
-"/mindie-turbo-0001.html) 起，支持 vLLM Ascend。"
+"是华为在昇腾硬件上开发的一款用于加速LLM推理引擎的插件库，包含自主研发的大语言模型优化算法及与推理引擎框架相关的优化。从 "
+"[2.0rc1](https://www.hiascend.com/document/detail/zh/mindie/20RC1/AcceleratePlugin/turbodev/mindie-"
+"turbo-0001.html) 起，支持 vLLM Ascend。"

-#: ../../source/community/user_stories/index.md:11
+#: ../../community/user_stories/index.md:11
 msgid ""
 "[GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU "
 "cluster manager for running AI models. It supports vLLM Ascend since "
-"[v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2). See "
-"more GPUStack performance evaluation information at [this "
-"link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew)."
+"[v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2), see more"
+" GPUStack performance evaluation info on "
+"[link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew)."
 msgstr ""
-"[GPUStack](https://github.com/gpustack/gpustack) 是一个开源的 GPU 集群管理器，用于运行 AI 模型。自 "
-"[v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2) 起支持 vLLM "
-"Ascend。更多 GPUStack 性能评测信息请参见 "
-"[此链接](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew)。"
+"[GPUStack](https://github.com/gpustack/gpustack) 是一个开源的 GPU 集群管理器，用于运行 AI "
+"模型。从 [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2) "
+"版本开始支持 vLLM Ascend，更多 GPUStack 性能评测信息见 "
+"[链接](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew)。"

-#: ../../source/community/user_stories/index.md:13
+#: ../../community/user_stories/index.md:13
 msgid ""
-"[verl](https://github.com/volcengine/verl) is a flexible, efficient, and "
-"production-ready RL training library for LLMs. It uses vLLM Ascend since "
-"[v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0). See "
-"more information on [verl x Ascend "
-"Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/quick_start/ascend_quick_start.html)."
+"[verl](https://github.com/volcengine/verl) is a flexible, efficient and "
+"production-ready RL training library for large language models (LLMs), uses "
+"vLLM Ascend since "
+"[v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0), see more "
+"info on [verl x Ascend "
+"Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html)."
 msgstr ""
 "[verl](https://github.com/volcengine/verl) "
-"是一个灵活、高效且可用于生产环境的大语言模型强化学习训练库。自 "
-"[v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0) 起，该库使用 "
-"vLLM Ascend。更多信息请参见 [verl x Ascend "
-"快速入门](https://verl.readthedocs.io/en/latest/ascend_tutorial/quick_start/ascend_quick_start.html)。"
+"是一个灵活、高效且可用于生产环境的大型语言模型（LLM）强化学习训练库，自 "
+"[v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0) 起支持 vLLM "
+"Ascend，更多信息请参见 [verl x Ascend "
+"快速上手](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/llamafactory.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/llamafactory.po
@@ -4,76 +4,84 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/community/user_stories/llamafactory.md:1
+#: ../../community/user_stories/llamafactory.md:1
 msgid "LLaMA-Factory"
 msgstr "LLaMA-Factory"

-#: ../../source/community/user_stories/llamafactory.md:3
-msgid "**Introduction**"
-msgstr "**简介**"
+#: ../../community/user_stories/llamafactory.md:3
+msgid "**About / Introduction**"
+msgstr "**关于 / 介绍**"

-#: ../../source/community/user_stories/llamafactory.md:5
+#: ../../community/user_stories/llamafactory.md:5
 msgid ""
-"[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-"
-"use and efficient platform for training and fine-tuning large language "
-"models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained "
-"models locally without writing any code."
+"[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use "
+"and efficient platform for training and fine-tuning large language models. "
+"With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally"
+" without writing any code."
 msgstr ""
-"[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 是一个易于使用且高效的平台，用于训练和微调大型语言模型。通过 LLaMA-Factory，您可以在本地对数百个预训练模型进行微调，无需编写任何代码。"
+"[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) "
+"是一个易于使用且高效的平台，用于训练和微调大型语言模型。有了 LLaMA-Factory，你可以在本地对数百个预训练模型进行微调，无需编写任何代码。"

-#: ../../source/community/user_stories/llamafactory.md:7
+#: ../../community/user_stories/llamafactory.md:7
 msgid ""
-"LLaMA-Factory users need to evaluate the model and perform inference "
-"after fine-tuning."
-msgstr "LLaMA-Factory 用户在完成微调后，需要对模型进行评估和推理。"
+"LLaMA-Facotory users need to evaluate and inference the model after fine-"
+"tuning the model."
+msgstr "LLaMA-Facotory 用户需要在对模型进行微调后对模型进行评估和推理。"

-#: ../../source/community/user_stories/llamafactory.md:9
-msgid "**Business challenge**"
+#: ../../community/user_stories/llamafactory.md:9
+msgid "**The Business Challenge**"
 msgstr "**业务挑战**"

-#: ../../source/community/user_stories/llamafactory.md:11
+#: ../../community/user_stories/llamafactory.md:11
 msgid ""
-"LLaMA-Factory uses Transformers to perform inference on Ascend NPUs, but "
-"the speed is slow."
-msgstr "LLaMA-Factory 使用 Transformers 在昇腾 NPU 上进行推理，但速度较慢。"
+"LLaMA-Factory used transformers to perform inference on Ascend NPU, but the "
+"speed was slow."
+msgstr "LLaMA-Factory 使用 transformers 在 Ascend NPU 上进行推理，但速度较慢。"

-#: ../../source/community/user_stories/llamafactory.md:13
-msgid "**Benefits with vLLM Ascend**"
-msgstr "**vLLM Ascend 带来的优势**"
+#: ../../community/user_stories/llamafactory.md:13
+msgid "**Solving Challenges and Benefits with vLLM Ascend**"
+msgstr "**通过 vLLM Ascend 解决挑战与收益**"

-#: ../../source/community/user_stories/llamafactory.md:15
+#: ../../community/user_stories/llamafactory.md:15
 msgid ""
 "With the joint efforts of LLaMA-Factory and vLLM Ascend ([LLaMA-"
-"Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), "
-"LLaMA-Factory has achieved significant performance gains during model "
-"inference. Benchmark results show that its inference speed is now up to "
-"2× faster compared to the Transformers implementation."
+"Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)), the "
+"performance of LLaMA-Factory in the model inference stage has been "
+"significantly improved. According to the test results, the inference speed "
+"of LLaMA-Factory has been increased to 2x compared to the transformers "
+"version."
 msgstr ""
-"通过 LLaMA-Factory 与 vLLM Ascend 的共同努力（[LLaMA-Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)），LLaMA-Factory 在模型推理阶段实现了显著的性能提升。基准测试结果表明，其推理速度相比 Transformers 实现最高提升了 2 倍。"
+"在 LLaMA-Factory 和 vLLM Ascend 的共同努力下（参见 [LLaMA-"
+"Factory#7739](https://github.com/hiyouga/LLaMA-Factory/pull/7739)），LLaMA-"
+"Factory 在模型推理阶段的性能得到了显著提升。根据测试结果，LLaMA-Factory 的推理速度相比 transformers 版本提升到了 2"
+" 倍。"

-#: ../../source/community/user_stories/llamafactory.md:17
+#: ../../community/user_stories/llamafactory.md:17
 msgid "**Learn more**"
 msgstr "**了解更多**"

-#: ../../source/community/user_stories/llamafactory.md:19
+#: ../../community/user_stories/llamafactory.md:19
 msgid ""
-"See more details about LLaMA-Factory and how it uses vLLM Ascend for "
-"inference on Ascend NPUs in [LLaMA-Factory Ascend NPU "
+"See more about LLaMA-Factory and how it uses vLLM Ascend for inference on "
+"the Ascend NPU in the following documentation: [LLaMA-Factory Ascend NPU "
 "Inference](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html)."
 msgstr ""
-"有关 LLaMA-Factory 的更多详情以及它如何在昇腾 NPU 上使用 vLLM Ascend 进行推理，请参阅 [LLaMA-Factory 昇腾 NPU 推理](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html)。"
+"在以下文档中查看更多关于 LLaMA-Factory 以及其如何在 Ascend NPU 上使用 vLLM Ascend 进行推理的信息：[LLaMA-"
+"Factory Ascend NPU "
+"推理](https://llamafactory.readthedocs.io/en/latest/advanced/npu_inference.html)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ACL_Graph.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ACL_Graph.po
@@ -1,283 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:1
-msgid "ACL Graph"
-msgstr "ACL 图"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:3
-msgid "Why do we need ACL Graph?"
-msgstr "为什么需要 ACL 图？"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:5
-msgid ""
-"In LLM inference, each token requires nearly a thousand operator "
-"executions. When host launching operators are slower than device, it will"
-" cause host bound. In severe cases, the device will be idle for more than"
-" half of the time. To solve this problem, we use graph in LLM inference."
-msgstr ""
-"在 LLM 推理中，每个 token 需要执行近千次算子。当主机（host）启动算子的速度慢于设备（device）时，会导致主机瓶颈（host bound）。在严重情况下，设备超过一半的时间将处于空闲状态。为了解决这个问题，我们在 LLM 推理中使用图（graph）。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:26
-msgid "How to use ACL Graph?"
-msgstr "如何使用 ACL 图？"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:28
-msgid ""
-"ACL Graph is enabled by default in V1 Engine, you just need to check that"
-" `enforce_eager` is not set to `True`. More details see: [Graph Mode "
-"Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)"
-msgstr ""
-"ACL 图在 V1 引擎中默认启用，您只需确认 `enforce_eager` 未设置为 `True`。更多详情请参阅：[图模式指南](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:30
-msgid "How it works?"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:32
-msgid ""
-"In short, graph mode works in two steps: **capture and replay**. When the"
-" engine starts, we capture all of the ops in the model forward and save "
-"it as a graph. When a request comes in, we just replay the graph on the "
-"device and wait for the result."
-msgstr ""
-"简而言之，图模式分两步工作：**捕获（capture）和重放（replay）**。当引擎启动时，我们捕获模型前向传播中的所有算子并将其保存为一个图。当请求到达时，我们只需在设备上重放该图并等待结果。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:34
-msgid "But in reality, graph mode is not that simple."
-msgstr "但实际上，图模式并非如此简单。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:36
-msgid "Padding and Bucketing"
-msgstr "填充与分桶"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:38
-msgid ""
-"Due to the fact that a graph can only replay the ops captured before, "
-"without doing tiling and checking graph input, we need to ensure the "
-"consistency of the graph input. However, we know that the model input's "
-"shape depends on the request scheduled by the Scheduler, so we can't "
-"ensure consistency."
-msgstr ""
-"由于图只能重放之前捕获的算子，而不会进行分片（tiling）或检查图输入，因此我们需要确保图输入的一致性。然而，我们知道模型输入的形状取决于调度器（Scheduler）安排的请求，因此无法保证一致性。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:40
-msgid ""
-"Obviously, we can solve this problem by capturing the biggest shape and "
-"padding all of the model inputs to it. But this will bring a lot of "
-"redundant computing and make performance worse. So we can capture "
-"multiple graphs with different shapes, and pad the model input to the "
-"nearest graph, which will greatly reduce redundant computing. But when "
-"`max_num_batched_tokens` is very large, the number of graphs that need to"
-" be captured will also become very large. We know that when the input "
-"tensor's shape is large, the computing time will be very long, and graph "
-"mode is not necessary in this case. So all of the things we need to do "
-"are:"
-msgstr ""
-"显然，我们可以通过捕获最大形状并将所有模型输入填充到该形状来解决此问题。但这会带来大量冗余计算并使性能变差。因此，我们可以捕获多个不同形状的图，并将模型输入填充到最接近的图，这将大大减少冗余计算。但当 `max_num_batched_tokens` 非常大时，需要捕获的图数量也会变得非常大。我们知道，当输入张量的形状很大时，计算时间会很长，在这种情况下图模式并非必要。因此，我们需要做的所有事情是："
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:42
-msgid "Set a threshold;"
-msgstr "设置一个阈值；"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:43
-msgid ""
-"When `num_scheduled_tokens` is bigger than the threshold, use "
-"`eager_mode`;"
-msgstr "当 `num_scheduled_tokens` 大于阈值时，使用 `eager_mode`；"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:44
-msgid "Capture multiple graphs within a range below the threshold;"
-msgstr "在低于阈值的范围内捕获多个图；"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:59
-msgid "Piecewise and Full graph"
-msgstr "分段图与完整图"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:61
-msgid ""
-"Due to the increasing complexity of the attention layer in current LLMs, "
-"we can't ensure all types of attention can run in graph. In MLA, "
-"prefill_tokens and decode_tokens have different calculation methods, so "
-"when a batch has both prefills and decodes in MLA, graph mode is "
-"difficult to handle this situation."
-msgstr ""
-"由于当前 LLM 中注意力层的复杂性不断增加，我们无法确保所有类型的注意力都能在图模式下运行。在 MLA 中，prefill_tokens 和 decode_tokens 有不同的计算方法，因此当 MLA 中的一个批次同时包含预填充和解码时，图模式难以处理这种情况。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:63
-msgid ""
-"vLLM solves this problem with piecewise graph mode. We use eager mode to "
-"launch attention's ops, and use graph to deal with others. But this also "
-"brings some problems: The cost of launching ops has become large again. "
-"Although much smaller than eager mode, it will also lead to host bound "
-"when the CPU is poor or `num_tokens` is small."
-msgstr ""
-"vLLM 通过分段图模式解决了这个问题。我们使用 eager 模式来启动注意力算子，并使用图来处理其他算子。但这也会带来一些问题：启动算子的开销再次变大。虽然比 eager 模式小得多，但当 CPU 性能较差或 `num_tokens` 较小时，仍会导致主机瓶颈。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:65
-msgid "Altogether, we need to support both piecewise and full graph mode."
-msgstr "总之，我们需要同时支持分段图和完整图模式。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:67
-msgid ""
-"When attention can run in graph, we tend to choose full graph mode to "
-"achieve optimal performance;"
-msgstr "当注意力可以在图中运行时，我们倾向于选择完整图模式以获得最佳性能；"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:68
-msgid "When full graph does not work, use piecewise graph as a substitute;"
-msgstr "当完整图无法工作时，使用分段图作为替代；"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:69
-msgid ""
-"When piecewise graph's performance is not good and full graph mode is "
-"blocked, separate prefills and decodes, and use full graph mode in "
-"**decode_only** situations. Because when a batch includes prefill "
-"requests, usually `num_tokens` will be quite big and not cause host "
-"bound."
-msgstr ""
-"当分段图性能不佳且完整图模式受阻时，将预填充和解码分离，并在 **decode_only** 情况下使用完整图模式。因为当一个批次包含预填充请求时，通常 `num_tokens` 会相当大，不会导致主机瓶颈。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:71
-msgid ""
-"Currently, due to stream resource constraint, we can only support a few "
-"buckets in piecewise graph mode now, which will cause redundant computing"
-" and may lead to performance degradation compared with eager mode."
-msgstr ""
-"目前，由于流资源限制，我们现在只能在分段图模式下支持少数几个桶（buckets），这会导致冗余计算，并且与 eager 模式相比可能导致性能下降。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:73
-msgid "How is it implemented?"
-msgstr "如何实现？"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:75
-msgid ""
-"vLLM has already implemented most of the modules in graph mode. You can "
-"see more details at: [CUDA "
-"Graphs](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)"
-msgstr ""
-"vLLM 已经在图模式下实现了大部分模块。您可以在以下链接查看更多详情：[CUDA 图](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:77
-msgid ""
-"When in graph mode, vLLM will call "
-"`current_platform.get_static_graph_wrapper_cls` to get the current "
-"device's graph model wrapper, so what we need to do is implement the "
-"graph mode wrapper on Ascend: `ACLGraphWrapper`."
-msgstr ""
-"在图模式下，vLLM 会调用 `current_platform.get_static_graph_wrapper_cls` 来获取当前设备的图模型包装器，因此我们需要做的是在 Ascend 上实现图模式包装器：`ACLGraphWrapper`。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:79
-msgid ""
-"vLLM has added `support_torch_compile` decorator to all models. This "
-"decorator will replace the `__init__` and `forward` interface of the "
-"model class. When `forward` is called, the code inside the "
-"`ACLGraphWrapper` will be executed, and it will do capture or replay as "
-"mentioned above."
-msgstr ""
-"vLLM 已为所有模型添加了 `support_torch_compile` 装饰器。此装饰器将替换模型类的 `__init__` 和 `forward` 接口。当调用 `forward` 时，`ACLGraphWrapper` 内部的代码将被执行，并执行如上所述的捕获或重放操作。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:81
-msgid ""
-"When using piecewise graph, we just need to follow the above-mentioned "
-"process. But when in full graph, due to the complexity of the attention, "
-"sometimes we need to update attention op's params before execution. So we"
-" implement `update_attn_params` and `update_mla_attn_params` functions "
-"for full graph mode. During forward, memory will be reused between "
-"different ops, so we can't update attention op's params before forward. "
-"In ACL Graph, we use `torch.npu.graph_task_update_begin` and "
-"`torch.npu.graph_task_update_end` to do it, and use "
-"`torch.npu.ExternalEvent` to ensure order between param updates and op "
-"executions."
-msgstr ""
-"使用分段图时，我们只需遵循上述流程。但在完整图模式下，由于注意力的复杂性，有时我们需要在执行前更新注意力算子的参数。因此，我们为完整图模式实现了 `update_attn_params` 和 `update_mla_attn_params` 函数。在前向传播期间，内存会在不同算子之间重用，因此我们无法在前向传播之前更新注意力算子的参数。在 ACL 图中，我们使用 `torch.npu.graph_task_update_begin` 和 `torch.npu.graph_task_update_end` 来实现这一点，并使用 `torch.npu.ExternalEvent` 来确保参数更新与算子执行之间的顺序。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:83
-msgid "DFX"
-msgstr "DFX"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:85
-msgid "Stream resource constraint"
-msgstr "流资源限制"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:87
-msgid ""
-"Currently, we can only capture 1800 graphs at most, due to the limitation"
-" of ACL graph that a graph requires at least a separate stream. This "
-"number is bounded by the number of streams, which is 2048; we save 248 "
-"streams as a buffer. Besides, there are many variables that can affect "
-"the number of buckets:"
-msgstr ""
-"目前，由于 ACL 图的限制（一个图至少需要一个独立的流），我们最多只能捕获 1800 个图。这个数字受限于流的数量，即 2048；我们保留 248 个流作为缓冲区。此外，还有许多变量会影响桶的数量："
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:89
-msgid ""
-"Piecewise graph divides the model into `num_hidden_layers + 1` sub "
-"modules, based on the attention layer. Every sub module is a single graph"
-" which needs to cost a stream, so the number of buckets in piecewise "
-"graph mode is very tight compared with full graph mode."
-msgstr ""
-"分段图根据注意力层将模型划分为 `num_hidden_layers + 1` 个子模块。每个子模块都是一个单独的图，需要消耗一个流，因此与完整图模式相比，分段图模式下的桶数量非常紧张。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:91
-msgid ""
-"The number of streams required for a graph is related to the number of "
-"comm domains. Each comm domain will increase one stream consumed by a "
-"graph."
-msgstr "一个图所需的流数量与通信域（comm domain）的数量有关。每个通信域都会增加一个图消耗的流。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:93
-msgid ""
-"When multi-stream is explicitly called in a sub module, it will consume "
-"an additional stream."
-msgstr "当在子模块中显式调用多流（multi-stream）时，它将消耗一个额外的流。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:95
-msgid ""
-"There are some other rules about ACL Graph and stream. Currently, we use "
-"func `update_aclgraph_sizes` to calculate the maximum number of buckets "
-"and update `graph_batch_sizes` to ensure stream resource is sufficient."
-msgstr ""
-"关于 ACL 图和流还有一些其他规则。目前，我们使用函数 `update_aclgraph_sizes` 来计算最大桶数并更新 `graph_batch_sizes`，以确保流资源充足。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:97
-msgid "We will expand the stream resource limitation in the future."
-msgstr "我们将在未来扩展流资源限制。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:99
-msgid "Limitations"
-msgstr "限制"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:101
-msgid "`FULL` and `FULL_AND_PIECEWISE` are not supported now;"
-msgstr "目前不支持 `FULL` 和 `FULL_AND_PIECEWISE`；"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:102
-msgid ""
-"When use ACL Graph and MTP and `num_speculative_tokens > 1`, as vLLM "
-"don't support this case in v0.11.0, we need to set "
-"`cudagraph_capture_sizes` explicitly."
-msgstr ""
-"当使用 ACL 图和 MTP 且 `num_speculative_tokens > 1` 时，由于 vLLM 在 v0.11.0 中不支持此情况，我们需要显式设置 `cudagraph_capture_sizes`。"
-
-#: ../../source/developer_guide/Design_Documents/ACL_Graph.md:103
-msgid "`use_inductor` is not supported now;"
-msgstr "目前不支持 `use_inductor`；"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po
@@ -1,350 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:1
-msgid "KV Cache Pool"
-msgstr "KV 缓存池"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:3
-msgid "Why KV Cache Pool?"
-msgstr "为什么需要 KV 缓存池？"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:5
-msgid ""
-"Prefix caching is an important feature in LLM inference that can reduce "
-"prefill computation time drastically."
-msgstr "前缀缓存是大语言模型推理中的一项重要特性，可以显著减少预填充计算时间。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:7
-msgid ""
-"However, the performance gain from prefix caching is highly dependent on "
-"the cache hit rate, while the cache hit rate can be limited if one only "
-"uses on-chip memory for KV cache storage."
-msgstr "然而，前缀缓存带来的性能提升高度依赖于缓存命中率，而如果仅使用片上内存存储 KV 缓存，缓存命中率会受到限制。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:9
-msgid ""
-"Hence, KV Cache Pool is proposed to utilize various types of storage "
-"including on-chip memory, DRAM, and SSD, making a pool for KV Cache "
-"storage while making the prefix of requests visible across all nodes, "
-"increasing the cache hit rate for all requests."
-msgstr ""
-"因此，我们提出了 KV 缓存池，旨在利用包括片上内存、DRAM 和 SSD 在内的多种存储类型，构建一个 KV "
-"缓存存储池，同时使请求的前缀在所有节点间可见，从而提高所有请求的缓存命中率。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:11
-msgid ""
-"vLLM Ascend currently supports [MooncakeStore](https://github.com"
-"/kvcache-ai/Mooncake), one of the most recognized KV Cache storage "
-"engines."
-msgstr ""
-"vLLM Ascend 目前支持 [MooncakeStore](https://github.com/kvcache-"
-"ai/Mooncake)，这是最受认可的 KV 缓存存储引擎之一。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:13
-msgid ""
-"While one can utilize MooncakeStore in vLLM V1 engine by setting it as a "
-"remote backend of LMCache with GPU (see "
-"[Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)),"
-" we find it would be better to integrate a connector that directly "
-"supports MooncakeStore and can utilize the data transfer strategy that "
-"best fits Huawei NPU hardware."
-msgstr ""
-"虽然可以通过将 MooncakeStore 设置为 GPU 上 LMCache 的远程后端来在 vLLM V1 "
-"引擎中使用它（参见[教程](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)），但我们认为集成一个直接支持"
-" MooncakeStore 并能利用最适合华为 NPU 硬件的数据传输策略的连接器会更好。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:15
-msgid ""
-"Hence, we propose to integrate MooncakeStore with a brand new "
-"**MooncakeStoreConnectorV1**, which is indeed largely inspired by "
-"**LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 "
-"Implemented?` section)."
-msgstr ""
-"因此，我们提议将 MooncakeStore 与全新的 **MooncakeStoreConnectorV1** "
-"集成，该连接器的设计在很大程度上受到了 **LMCacheConnectorV1** 的启发（参见 "
-"`MooncakeStoreConnectorV1 是如何实现的？` 部分）。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:17
-msgid "Usage"
-msgstr "使用方法"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:19
-msgid ""
-"vLLM Ascend currently supports MooncakeStore for KV Cache Pool. To enable"
-" MooncakeStore, one needs to configure `kv-transfer-config` and choose "
-"`MooncakeStoreConnector` as the KV Connector."
-msgstr ""
-"vLLM Ascend 目前支持使用 MooncakeStore 作为 KV 缓存池。要启用 MooncakeStore，需要配置 `kv-"
-"transfer-config` 并选择 `MooncakeStoreConnector` 作为 KV 连接器。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:21
-msgid ""
-"For step-by-step deployment and configuration, please refer to the [KV "
-"Pool User "
-"Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html)."
-msgstr ""
-"关于逐步部署和配置，请参考 [KV "
-"池用户指南](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html)。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:23
-msgid "How it works?"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:25
-msgid ""
-"The KV Cache Pool integrates multiple memory tiers (on-chip memory, DRAM,"
-" SSD, etc.) through a connector-based architecture."
-msgstr "KV 缓存池通过基于连接器的架构，整合了多个内存层级（片上内存、DRAM、SSD 等）。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:27
-msgid ""
-"Each connector implements a unified interface for storing, retrieving, "
-"and transferring KV blocks between tiers, depending on access frequency "
-"and hardware bandwidth."
-msgstr "每个连接器实现了一个统一的接口，用于根据访问频率和硬件带宽在不同层级之间存储、检索和传输 KV 块。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:29
-msgid ""
-"When combined with vLLM's Prefix Caching mechanism, the pool enables "
-"efficient caching both locally (in on-chip memory) and globally (via "
-"Mooncake), ensuring that frequently used prefixes remain hot while less "
-"frequently accessed KV data can spill over to lower-cost memory."
-msgstr ""
-"当与 vLLM 的前缀缓存机制结合时，该池能够实现本地（片上内存中）和全局（通过 "
-"Mooncake）的高效缓存，确保常用前缀保持热状态，而访问频率较低的 KV 数据则可以溢出到成本更低的内存中。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:31
-msgid "1. Combining KV Cache Pool with on-chip memory Prefix Caching"
-msgstr "1.将 KV 缓存池与片上内存前缀缓存结合"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:33
-msgid ""
-"Prefix Caching with on-chip memory is already supported by the vLLM V1 "
-"Engine. By introducing KV Connector V1, users can seamlessly combine on-"
-"chip memory-based Prefix Caching with Mooncake-backed KV Pool."
-msgstr ""
-"vLLM V1 引擎已支持基于片上内存的前缀缓存。通过引入 KV Connector V1，用户可以无缝地将基于片上内存的前缀缓存与 "
-"Mooncake 支持的 KV 池结合起来。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:36
-msgid ""
-"The user can enable both features simply by enabling Prefix Caching, "
-"which is enabled by default in vLLM V1 unless the "
-"`--no_enable_prefix_caching` flag is set, and setting up the KV Connector"
-" for KV Pool (e.g., the MooncakeStoreConnector)."
-msgstr ""
-"用户只需启用前缀缓存（在 vLLM V1 中默认启用，除非设置了 `--no_enable_prefix_caching` 标志）并为 KV "
-"池设置 KV 连接器（例如 MooncakeStoreConnector），即可同时启用这两个功能。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:38
-msgid "**Workflow**:"
-msgstr "**工作流程**："
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:40
-msgid "The engine first checks for prefix hits in the on-chip memory cache."
-msgstr "引擎首先检查片上内存缓存中的前缀命中情况。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:42
-msgid ""
-"After getting the number of hit tokens on on-chip memory, it queries the "
-"KV Pool via the connector. If there are additional hits in the KV Pool, "
-"we get the **additional blocks only** from the KV Pool, and get the rest "
-"of the blocks directly from on-chip memory to minimize the data transfer "
-"latency."
-msgstr ""
-"获取片上内存上的命中令牌数量后，引擎通过连接器查询 KV 池。如果在 KV 池中有额外的命中，我们**仅从 KV "
-"池获取额外的块**，其余块则直接从片上内存获取，以最小化数据传输延迟。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:44
-msgid ""
-"After the KV Caches in the KV Pool are loaded into on-chip memory, the "
-"remaining process is the same as Prefix Caching in on-chip memory."
-msgstr "将 KV 池中的 KV 缓存加载到片上内存后，剩余过程与片上内存中的前缀缓存相同。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:46
-msgid "2. Combining KV Cache Pool with Mooncake PD Disaggregation"
-msgstr "2.将 KV 缓存池与 Mooncake PD 解耦结合"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:48
-msgid ""
-"When used together with Mooncake PD (Prefill-Decode) Disaggregation, the "
-"KV Cache Pool can further decouple prefill and decode stages across "
-"devices or nodes."
-msgstr "当与 Mooncake PD（预填充-解码）解耦功能结合使用时，KV 缓存池可以进一步在设备或节点间解耦预填充和解码阶段。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:50
-msgid ""
-"Currently, we only perform put and get operations of KV Pool for "
-"**Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P "
-"KV Connector, i.e., MooncakeConnector."
-msgstr ""
-"目前，我们仅对**预填充节点**执行 KV 池的 put 和 get 操作，解码节点则通过 Mooncake P2P KV 连接器（即 "
-"MooncakeConnector）获取其 KV 缓存。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:52
-msgid ""
-"The key benefit of doing this is that we can keep the gain in performance"
-" by computing less with Prefix Caching from on-chip memory and KV Pool "
-"for Prefill Nodes, while not sacrificing the data transfer efficiency "
-"between Prefill and Decode nodes with P2P KV Connector that transfers KV "
-"Caches between NPU devices directly."
-msgstr ""
-"这样做的主要好处是，我们可以通过为预填充节点使用来自片上内存和 KV "
-"池的前缀缓存来减少计算量，从而保持性能增益，同时又不牺牲预填充节点与解码节点之间的数据传输效率，因为 P2P KV 连接器直接在 NPU "
-"设备间传输 KV 缓存。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:54
-msgid ""
-"To enable this feature, we need to set up both Mooncake Connector and "
-"MooncakeStore Connector with a Multi Connector, which is a KV Connector "
-"class provided by vLLM that can call multiple KV Connectors in a specific"
-" order."
-msgstr ""
-"要启用此功能，我们需要使用 Multi Connector 来设置 Mooncake Connector 和 MooncakeStore "
-"Connector。Multi Connector 是 vLLM 提供的一个 KV 连接器类，可以按特定顺序调用多个 KV 连接器。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:56
-msgid ""
-"For details, please also refer to the Mooncake Connector Store Deployment"
-" Guide."
-msgstr "详情请参阅 Mooncake Connector Store 部署指南。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:58
-msgid "How is MooncakeStoreConnectorV1 Implemented?"
-msgstr "MooncakeStoreConnectorV1 是如何实现的？"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:60
-msgid ""
-"**MooncakeStoreConnectorV1** inherits the KV Connector V1 class in vLLM "
-"V1: through implementing the required methods defined in the KV connector"
-" V1 base class, one can integrate a third-party KV cache transfer/storage"
-" backend into the vLLM framework."
-msgstr ""
-"**MooncakeStoreConnectorV1** 继承自 vLLM V1 中的 KV Connector V1 类：通过实现 KV 连接器"
-" V1 基类中定义的必要方法，可以将第三方 KV 缓存传输/存储后端集成到 vLLM 框架中。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:62
-msgid ""
-"MooncakeStoreConnectorV1 is also largely inspired by LMCacheConnectorV1 "
-"in terms of the `Lookup Engine`/`Lookup Client` design for looking up KV "
-"cache keys, and the `ChunkedTokenDatabase` class for processing tokens "
-"into prefix-aware hashes as well as other hashing related designs. On top"
-" of this, we have also added our own design including `KVTransferThread` "
-"that allows async `get` and `put` of KV caches with multi-threading, and "
-"NPU-related data transfer optimization such as removing the `LocalBuffer`"
-" in LMCache to remove redundant data transfer."
-msgstr ""
-"MooncakeStoreConnectorV1 也在很大程度上借鉴了 LMCacheConnectorV1，包括用于查找 KV 缓存键的 "
-"`Lookup Engine`/`Lookup Client` 设计，以及用于将令牌处理为前缀感知哈希的 "
-"`ChunkedTokenDatabase` 类和其他哈希相关设计。在此基础上，我们还添加了自己的设计，包括允许通过多线程异步 `get` 和 "
-"`put` KV 缓存的 `KVTransferThread`，以及与 NPU 相关的数据传输优化，例如移除 LMCache 中的 "
-"`LocalBuffer` 以消除冗余数据传输。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:64
-msgid ""
-"The KV Connector methods that need to be implemented can be categorized "
-"into scheduler-side methods that are called in V1 scheduler and worker-"
-"side methods that are called in V1 worker, namely:"
-msgstr "需要实现的 KV 连接器方法可以分为在 V1 调度器中调用的调度器端方法和在 V1 工作器中调用的工作器端方法，即："
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:66
-msgid "KV Connector Scheduler-Side Methods"
-msgstr "KV 连接器调度器端方法"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:68
-msgid ""
-"`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens "
-"through looking up into the KV pool.   `update_states_after_alloc`:  "
-"Update KVConnector state after temporary buffer alloc.   "
-"`build_connector_meta`: Attach the connector metadata to the request "
-"object.   `request_finished`: Once a request is finished, determine "
-"whether request blocks should be freed now or will be sent asynchronously"
-" and freed later."
-msgstr ""
-"`get_num_new_matched_tokens`：通过查询 KV 池，获取以令牌数表示的前缀缓存命中数。\n"
-"`update_states_after_alloc`：临时缓冲区分配后更新 KVConnector 状态。\n"
-"`build_connector_meta`：将连接器元数据附加到请求对象。\n"
-"`request_finished`：请求完成后，确定请求块是应立即释放，还是将异步发送并稍后释放。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:73
-msgid "Connector Worker-Side Methods"
-msgstr "连接器工作器端方法"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:75
-msgid ""
-"`register_kv_caches`: Register KV cache buffers needed for KV cache "
-"transfer. `start_load_kv`: Perform KV cache load operation that transfers"
-" KV cache from storage to device. `wait_for_layer_load`: Optional; Wait "
-"for layer load in layerwise + async KV load scenario. `save_kv_layer`: "
-"Optional; Do layerwise KV cache put into KV Pool. `wait_for_save`: Wait "
-"for KV Save to finish if async KV cache save/put. `get_finished`: Get "
-"request that finished KV transfer, `done_sending` if `put` finished, "
-"`done_receiving` if `get` finished."
-msgstr ""
-"`register_kv_caches`：注册 KV 缓存传输所需的 KV 缓存缓冲区。\n"
-"`start_load_kv`：执行 KV 缓存加载操作，将 KV 缓存从存储传输到设备。\n"
-"`wait_for_layer_load`：可选；在分层 + 异步 KV 加载场景中等待层加载。\n"
-"`save_kv_layer`：可选；执行分层 KV 缓存放入 KV 池的操作。\n"
-"`wait_for_save`：如果异步保存/放入 KV 缓存，则等待 KV 保存完成。\n"
-"`get_finished`：获取已完成 KV 传输的请求，如果 `put` 完成则为 `done_sending`，如果 `get` 完成则为 "
-"`done_receiving`。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:82
-msgid "DFX"
-msgstr "DFX（可诊断性、可维护性、可服务性）"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:84
-msgid ""
-"When looking up a key in KV Pool, if we cannot find the key, there is no "
-"Cache Hit for this specific block; we return no hit for this block and do"
-" not look up further blocks for the current request."
-msgstr "在 KV 池中查找键时，如果找不到该键，则此特定块没有缓存命中；我们返回此块未命中，并且不再为当前请求查找后续块。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:85
-msgid ""
-"Similarly, when we are trying to put a block into KV Pool and it fails, "
-"we do not put further blocks (subject to change)."
-msgstr "类似地，当我们尝试将一个块放入 KV 池但失败时，我们不会放入后续块（可能更改）。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:87
-msgid "Limitations"
-msgstr "限制"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:89
-msgid ""
-"Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the "
-"storage for KV Cache pool."
-msgstr ""
-"目前，vLLM-Ascend 的 MooncakeStore 仅支持 DRAM 作为 KV 缓存池的存储介质。"
-
-#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:91
-msgid ""
-"For now, if we successfully looked up a key and found it exists, but "
-"failed to get it when calling KV Pool's get function, we just output a "
-"log indicating the get operation failed and keep going; hence, the "
-"accuracy of that specific request may be affected. We will handle this "
-"situation by falling back the request and re-compute everything assuming "
-"there's no prefix cache hit (or even better, revert only one block and "
-"keep using the Prefix Caches before that)."
-msgstr ""
-"目前，如果我们成功查找到一个键并确认其存在，但在调用 KV 池的 get 函数时获取失败，我们仅输出一条日志表明 get "
-"操作失败并继续执行；因此，该特定请求的准确性可能会受到影响。我们将通过回退该请求并假设没有前缀缓存命中来重新计算所有内容（或者更优的方案是，仅回退一个块并继续使用该块之前的前缀缓存）来处理这种情况。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po
@@ -1,665 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-15 09:41+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:1
-msgid "Prepare inputs for model forwarding"
-msgstr "为模型前向传播准备输入"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:3
-msgid "Purpose"
-msgstr "目的"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:5
-msgid "Information required to perform model forward pass:"
-msgstr "执行模型前向传播所需的信息："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:7
-msgid "the inputs"
-msgstr "输入"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:8
-msgid "the corresponding attention metadata of the inputs"
-msgstr "输入对应的注意力元数据"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:10
-msgid "The following diagram shows what we should prepare for model inference."
-msgstr "下图展示了我们需要为模型推理准备的内容。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:20
-msgid ""
-"Therefore, as long as we have these two pieces of information mentioned "
-"above, we can perform the model's forward propagation."
-msgstr "因此，只要我们拥有上述两方面的信息，就可以执行模型的前向传播。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:22
-msgid ""
-"This document will explain **how we obtain the inputs and their "
-"corresponding attention metadata**."
-msgstr "本文将解释**我们如何获取输入及其对应的注意力元数据**。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:24
-msgid "Overview"
-msgstr "概述"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:26
-msgid "1. Obtain inputs"
-msgstr "1. 获取输入"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:28
-msgid "The workflow of obtaining inputs:"
-msgstr "获取输入的工作流程："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:30
-msgid ""
-"Get `token positions`: relative position of each token within its request"
-" sequence."
-msgstr "获取 `token positions`：每个 token 在其请求序列中的相对位置。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:32
-msgid "Get `token indices`: index of each scheduled token in the token table."
-msgstr "获取 `token indices`：每个已调度 token 在 token 表中的索引。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:34
-msgid ""
-"Get `Token IDs`: using token indices to retrieve the Token IDs from "
-"**token id table**."
-msgstr "获取 `Token IDs`：使用 token indices 从 **token id table** 中检索 Token IDs。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:36
-msgid ""
-"At last, these `Token IDs` are required to be fed into a model, and "
-"`positions` should also be sent into the model to create `Rope` (Rotary "
-"positional embedding). Both of them are the inputs of the model."
-msgstr ""
-"最后，这些 `Token IDs` 需要输入到模型中，`positions` 也需要送入模型以创建 "
-"`Rope`（旋转位置编码）。两者共同构成模型的输入。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:38
-msgid ""
-"**Note**: The `Token IDs` are the inputs of a model, so we also call them"
-" `Input IDs`."
-msgstr "**注意**：`Token IDs` 是模型的输入，因此我们也称它们为 `Input IDs`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:40
-msgid "2. Build inputs attention metadata"
-msgstr "2. 构建输入注意力元数据"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:42
-msgid "A model requires these attention metadata during the forward pass:"
-msgstr "模型在前向传播过程中需要以下注意力元数据："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:44
-msgid ""
-"`query start location`: start and end location of each request "
-"corresponding to the scheduled tokens."
-msgstr "`query start location`：每个请求对应的已调度 token 的起始和结束位置。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:45
-msgid ""
-"`sequence length`: length of each request including both computed tokens "
-"and newly scheduled tokens."
-msgstr "`sequence length`：每个请求的长度，包括已计算 token 和新调度的 token。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:46
-msgid "`number of computed tokens`: number of computed tokens for each request."
-msgstr "`number of computed tokens`：每个请求已计算 token 的数量。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:47
-msgid "`number of requests`: number of requests in this batch."
-msgstr "`number of requests`：本批次中的请求数量。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:48
-msgid "`number of tokens`: total number of scheduled tokens in this batch."
-msgstr "`number of tokens`：本批次中已调度 token 的总数。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:49
-msgid ""
-"**`block table`**: translates the logical address (within its sequence) "
-"of each block to its global physical address in the device's memory."
-msgstr "**`block table`**：将每个块在其序列内的逻辑地址转换为其在设备内存中的全局物理地址。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:50
-msgid ""
-"`max query len`: the longest scheduled tokens length in this request "
-"batch."
-msgstr "`max query len`：本请求批次中最长的已调度 token 长度。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:51
-msgid ""
-"`slot mapping`: indices of each token that input token will be stored "
-"into."
-msgstr "`slot mapping`：输入 token 将被存储到的每个 token 的索引。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:52
-msgid ""
-"`attention mask`: mask matrix applied to attention scores before softmax "
-"to control which tokens can attend to each other (usually a causal "
-"attention)."
-msgstr "`attention mask`：在 softmax 之前应用于注意力分数的掩码矩阵，用于控制哪些 token 可以相互关注（通常是因果注意力）。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:54
-msgid "Before start"
-msgstr "开始之前"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:56
-msgid "There are mainly three types of variables."
-msgstr "主要有三种类型的变量。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:58
-msgid ""
-"token level: represents one attribute corresponding to each scheduled "
-"token, so the length of this variable is the number of scheduled tokens."
-msgstr "token 级别：代表每个已调度 token 对应的一个属性，因此该变量的长度等于已调度 token 的数量。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:59
-msgid ""
-"request level: represents one attribute of each scheduled request, whose "
-"length usually is the number of scheduled requests. (`query start "
-"location` is a special case, which has one more element.)"
-msgstr "请求级别：代表每个已调度请求的一个属性，其长度通常等于已调度请求的数量。（`query start location` 是一个特例，它多一个元素。）"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:60
-msgid "system level:"
-msgstr "系统级别："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:61
-msgid ""
-"**Token IDs table**: stores the token IDs (i.e. the inputs of a model) of"
-" each request. The shape of this table is `(max num request, max model "
-"len)`. Here, `max num request` is the maximum count of concurrent "
-"requests allowed in a forward batch and `max model len` is the maximum "
-"token count that can be handled at one request sequence in this model."
-msgstr ""
-"**Token IDs table**：存储每个请求的 token IDs（即模型的输入）。此表的形状为 `(max num request, "
-"max model len)`。其中，`max num request` 是前向批次中允许的最大并发请求数，`max model len` "
-"是该模型中单个请求序列可以处理的最大 token 数量。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:62
-msgid ""
-"**Block table**: translates the logical address (within its sequence) of "
-"each block to its global physical address in the device's memory. The "
-"shape of this table is `(max num request, max model len / block size)`"
-msgstr ""
-"**Block table**：将每个块在其序列内的逻辑地址转换为其在设备内存中的全局物理地址。此表的形状为 `(max num request,"
-" max model len / block size)`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:64
-msgid ""
-"**Note**: Both of these two tables come from the `_update_states` method "
-"before **preparing inputs**. You can take a look if you need more "
-"inspiration."
-msgstr "**注意**：这两个表都来自 **准备输入** 之前的 `_update_states` 方法。如果需要更多启发，可以查看一下。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:66
-msgid "Tips"
-msgstr "提示"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:68
-msgid ""
-"Simply put, a `token ID` is an **integer** (usually `int32`), which "
-"represents a token. Example of `Token ID`:"
-msgstr "简而言之，一个 `token ID` 是一个**整数**（通常是 `int32`），它代表一个 token。`Token ID` 示例："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:88
-msgid "Go through details"
-msgstr "深入细节"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:90
-msgid "Assumptions:"
-msgstr "假设："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:92
-msgid "maximum number of tokens that can be scheduled at once: 10"
-msgstr "一次可调度的最大 token 数：10"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:93
-msgid "`block size`: 2"
-msgstr "`block size`：2"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:94
-msgid ""
-"Totally schedule 3 requests. Their prompt lengths are 3, 2, and 8 "
-"respectively."
-msgstr "总共调度 3 个请求。它们的提示长度分别为 3、2 和 8。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:95
-msgid ""
-"`max model length`: 12 (the maximum token count that can be handled at "
-"one request sequence in a model)."
-msgstr "`max model length`：12（模型中单个请求序列可以处理的最大 token 数量）。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:97
-msgid ""
-"These assumptions are configured at the beginning when starting vLLM. "
-"They are not fixed, so you can manually set them."
-msgstr "这些假设是在启动 vLLM 时配置的。它们不是固定的，因此可以手动设置。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:99
-msgid "Step 1: All requests in the prefill phase"
-msgstr "步骤 1：所有请求均处于预填充阶段"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:101
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:213
-msgid "Obtain inputs"
-msgstr "获取输入"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:103
-msgid ""
-"As the maximum number of tokens that can be scheduled is 10, the "
-"scheduled tokens of each request can be represented as `{'0': 3, '1': 2, "
-"'2': 5}`. Note that `request_2` uses chunked prefill, leaving 3 prompt "
-"tokens unscheduled."
-msgstr ""
-"由于一次可调度的最大 token 数为 10，每个请求的已调度 token 可以表示为 `{'0': 3, '1': 2, '2': 5}`。注意"
-" `request_2` 使用了分块预填充，留下了 3 个提示 token 未调度。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:105
-msgid "1. Get token positions"
-msgstr "1. 获取 token positions"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:107
-msgid ""
-"First, determine which request each token belongs to: tokens 0–2 are "
-"assigned to **request_0**, tokens 3–4 to **request_1**, and tokens 5–9 to"
-" **request_2**. To represent this mapping, we use `request indices`, for "
-"example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`."
-msgstr ""
-"首先，确定每个 token 属于哪个请求：token 0–2 分配给 **request_0**，token 3–4 分配给 "
-"**request_1**，token 5–9 分配给 **request_2**。为了表示这种映射，我们使用 `request "
-"indices`，例如，`request indices`：`[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:109
-msgid ""
-"For each request, use **the number of computed tokens** + **the relative "
-"position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + "
-"2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`)"
-" and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`)."
-msgstr ""
-"对于每个请求，使用 **已计算 token 的数量** + **当前调度 token 的相对位置**（`request_0: [0 + 0, 0 "
-"+ 1, 0 + 2]`，`request_1: [0 + 0, 0 + 1]`，`request_2: [0 + 0, 0 + 1,..., 0"
-" + 4]`），然后将它们连接在一起（`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`）。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:111
-msgid ""
-"Note: there is a more efficient way (using `request indices`) to create "
-"positions in actual code."
-msgstr "注意：在实际代码中，有一种更高效的方法（使用 `request indices`）来创建 positions。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:113
-msgid ""
-"Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, "
-"3, 4]`. This variable is **token level**."
-msgstr ""
-"最后，`token positions` 可以获取为 `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`。此变量是 **token "
-"级别** 的。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:115
-msgid "2. Get token indices"
-msgstr "2. 获取 token indices"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:117
-msgid ""
-"The shape of the current **Token IDs table** is `(max num request, max "
-"model len)`."
-msgstr "当前 **Token IDs table** 的形状为 `(max num request, max model len)`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:119
-msgid ""
-"Why are these `T_3_5`, `T_3_6`, `T_3_7` in this table without being "
-"scheduled?"
-msgstr "为什么表中的 `T_3_5`、`T_3_6`、`T_3_7` 没有被调度？"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:121
-msgid ""
-"We fill all Token IDs in one request sequence to this table at once, but "
-"we only retrieve the tokens we scheduled this time. Then we retrieve the "
-"remaining Token IDs next time."
-msgstr "我们将一个请求序列中的所有 Token IDs 一次性填充到此表中，但我们只检索本次调度的 token。然后下次再检索剩余的 Token IDs。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:133
-msgid "Note that `T_x_x` is an `int32`."
-msgstr "注意 `T_x_x` 是一个 `int32`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:135
-msgid ""
-"Let's say `M = max model len`. Then we can use `token positions` together"
-" with `request indices` of each token to construct `token indices`."
-msgstr ""
-"假设 `M = max model len`。那么我们可以使用 `token positions` 以及每个 token 的 `request "
-"indices` 来构造 `token indices`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:137
-msgid ""
-"So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 "
-"* M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2,"
-" 12, 13, 24, 25, 26, 27, 28]`"
-msgstr ""
-"所以 `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 "
-"* M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2,"
-" 12, 13, 24, 25, 26, 27, 28]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:139
-msgid "3. Retrieve the Token IDs"
-msgstr "3. 检索 Token IDs"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:141
-msgid ""
-"We use `token indices` to select out the corresponding `Input IDs` from "
-"the token table. The pseudocode is as follows:"
-msgstr "我们使用 `token indices` 从 token 表中选择出对应的 `Input IDs`。伪代码如下："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:147
-msgid "As mentioned before, we refer to these `Token IDs` as `Input IDs`."
-msgstr "如前所述，我们将这些 `Token IDs` 称为 `Input IDs`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:149
-msgid ""
-"`Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, "
-"T_3_3, T_3_4]`"
-msgstr ""
-"`Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, "
-"T_3_3, T_3_4]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:151
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:237
-msgid "Build inputs attention metadata"
-msgstr "构建输入注意力元数据"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:153
-msgid ""
-"In the current **Block Table**, we use the first block (i.e. block_0) to "
-"mark the unused block. The shape of the block is `(max num request, max "
-"model len / block size)`, where `max model len / block size = 12 / 2 = "
-"6`."
-msgstr ""
-"在当前的**块表**中，我们使用第一个块（即 block_0）来标记未使用的块。块的形状为 `(最大请求数, 最大模型长度 / 块大小)`，其中 "
-"`最大模型长度 / 块大小 = 12 / 2 = 6`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:165
-msgid "The KV cache block in the device memory is like:"
-msgstr "设备内存中的 KV 缓存块如下所示："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:171
-msgid ""
-"Let's say `K = max model len / block size = 6`, and we can get token "
-"`device block number`."
-msgstr "假设 `K = 最大模型长度 / 块大小 = 6`，我们可以得到令牌的`设备块编号`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:173
-msgid "The workflow of achieving slot mapping:"
-msgstr "实现槽映射的工作流程："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:175
-msgid "Get `block table indices` using `K`, `positions` and `request indices`."
-msgstr "使用 `K`、`positions` 和 `request indices` 获取`块表索引`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:177
-msgid ""
-"Purpose: For each token, it could be used to select `device block number`"
-" from `block table`."
-msgstr "目的：对于每个令牌，它可用于从`块表`中选择`设备块编号`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:179
-msgid "Get `device block number` using `block table indices`."
-msgstr "使用`块表索引`获取`设备块编号`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:181
-msgid ""
-"Purpose: `device block number` indicates which device block each token "
-"belongs to."
-msgstr "目的：`设备块编号`指示每个令牌属于哪个设备块。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:183
-msgid "Get `block offsets` using `positions` and `block size`."
-msgstr "使用 `positions` 和 `block size` 获取`块内偏移`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:185
-msgid ""
-"Purpose: `block offsets` indicates the offsets of each token within a "
-"block."
-msgstr "目的：`块内偏移`指示每个令牌在块内的偏移量。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:187
-msgid "construct `slot mapping` using `device block number` and `block offsets`."
-msgstr "使用`设备块编号`和`块内偏移`构建`槽映射`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:189
-msgid "Purpose: we can use `slot mapping` to store Token IDs into token slots."
-msgstr "目的：我们可以使用`槽映射`将令牌 ID 存储到令牌槽中。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:191
-msgid "Details:"
-msgstr "详细信息："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:193
-msgid ""
-"(**Token level**) Use a simple formula to calculate `block table "
-"indices`: `request indices * K + positions / block size`. So it equals "
-"`[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 /"
-" 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4"
-" / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to "
-"select `device block number` from `block table`."
-msgstr ""
-"(**令牌级别**) 使用一个简单的公式计算`块表索引`：`request indices * K + positions / block "
-"size`。因此它等于 `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2,"
-" 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / "
-"2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, "
-"14]`。这可用于从`块表`中选择`设备块编号`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:194
-msgid ""
-"(**Token level**) Use `block table indices` to select out `device block "
-"number` for each scheduled token. The pseudocode is `block_numbers = "
-"block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, "
-"3, 4, 4, 5, 5, 6]`"
-msgstr ""
-"(**令牌级别**) 使用`块表索引`为每个已调度的令牌选择出`设备块编号`。伪代码为 `block_numbers = "
-"block_table[block_table_indices]`。因此 `设备块编号=[1, 1, 2, 3, 3, 4, 4, 5, 5, "
-"6]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:195
-msgid ""
-"(**Token level**) `block offsets` could be computed by `block offsets = "
-"positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`."
-msgstr ""
-"(**令牌级别**) `块内偏移`可以通过 `block offsets = positions % block size = [0, 1, 0,"
-" 0, 1, 0, 1, 0, 1, 0]` 计算得出。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:196
-msgid ""
-"Finally, use `block offsets` and `device block number` to create `slot "
-"mapping`: `device block number * block size + block_offsets = [2, 3, 4, "
-"6, 7, 8, 9, 10, 11, 12]`"
-msgstr ""
-"最后，使用`块内偏移`和`设备块编号`创建`槽映射`：`设备块编号 * 块大小 + 块内偏移 = [2, 3, 4, 6, 7, 8, 9, "
-"10, 11, 12]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:198
-msgid "(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:"
-msgstr "(**请求级别**) 已知已调度的令牌数量为 `[3, 2, 5]`："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:200
-msgid ""
-"(**Request level**) Use prefix sum to calculate `query start location`: "
-"`[0, 3, 5, 10]`."
-msgstr "(**请求级别**) 使用前缀和计算`查询起始位置`：`[0, 3, 5, 10]`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:201
-msgid ""
-"(**Request level**) All tokens in step 1 are in the prefill stage, and "
-"the computed tokens count is 0; then `sequence length` = `[3, 2, 5]`."
-msgstr "(**请求级别**) 步骤 1 中的所有令牌都处于预填充阶段，已计算的令牌数量为 0；因此 `序列长度` = `[3, 2, 5]`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:202
-msgid ""
-"(**Request level**) As mentioned above, `number of computed tokens` are "
-"all 0s: `[0, 0, 0]`."
-msgstr "(**请求级别**) 如上所述，`已计算令牌数`均为 0：`[0, 0, 0]`。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:203
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:272
-msgid "`number of requests`: `3`"
-msgstr "`请求数量`：`3`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:204
-msgid "(**Request level**) `number of tokens`: `[3, 2, 5]`"
-msgstr "(**请求级别**) `令牌数量`：`[3, 2, 5]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:205
-msgid "`max query len`: `5`"
-msgstr "`最大查询长度`：`5`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:206
-msgid "(**Token level**) `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`"
-msgstr "(**令牌级别**) `槽映射`：`[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:207
-msgid ""
-"`attention mask`: For all requests that initiate a prefill process, we "
-"simply create only one mask matrix for reuse across different requests. "
-"The shape of this mask matrix is `5 * 5`:"
-msgstr "`注意力掩码`：对于所有发起预填充过程的请求，我们仅创建一个掩码矩阵，以便在不同请求间复用。该掩码矩阵的形状为 `5 * 5`："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:209
-msgid "Step 2: Chunked prefill"
-msgstr "步骤 2：分块预填充"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:211
-msgid ""
-"In Step 2, we no longer provide explanations or perform calculations; "
-"instead, we directly present the final result."
-msgstr "在步骤 2 中，我们不再提供解释或进行计算；而是直接呈现最终结果。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:215
-#, python-brace-format
-msgid "Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`"
-msgstr "每个请求的已调度令牌：`{'0': 1, '1': 1, '2': 3}`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:217
-msgid "`request indices`: `[0, 1, 2, 2, 2]`"
-msgstr "`请求索引`：`[0, 1, 2, 2, 2]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:218
-msgid "`token positions`: `[3, 2, 5, 6, 7]`"
-msgstr "`令牌位置`：`[3, 2, 5, 6, 7]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:220
-msgid "Current **Token IDs table**:"
-msgstr "当前**令牌 ID 表**："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:232
-msgid ""
-"**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and "
-"**request_1** respectively. They are sampled from the output of the "
-"model."
-msgstr ""
-"**注意**：**T_0_3**、**T_1_2** 分别是 **request_0** 和 **request_1** 的新令牌 "
-"ID。它们是从模型输出中采样得到的。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:234
-msgid "`token indices`: `[3, 14, 29, 30, 31]`"
-msgstr "`令牌索引`：`[3, 14, 29, 30, 31]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:235
-msgid "`Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`"
-msgstr "`输入 ID`：`[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:239
-msgid ""
-"We allocate the blocks `7` and `8` to `request_1` and `request_2` "
-"respectively, as they need more space in device to store KV cache "
-"following token generation or chunked prefill."
-msgstr ""
-"我们将块 `7` 和 `8` 分别分配给 `request_1` 和 "
-"`request_2`，因为它们在令牌生成或分块预填充后需要更多设备空间来存储 KV 缓存。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:241
-msgid "Current **Block Table**:"
-msgstr "当前**块表**："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:253
-msgid "KV cache block in the device memory:"
-msgstr "设备内存中的 KV 缓存块："
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:259
-msgid "(**Token level**) `block table indices`: `[1, 7, 14, 15, 15]`"
-msgstr "(**令牌级别**) `块表索引`：`[1, 7, 14, 15, 15]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:260
-msgid "(**Token level**) `device block number`: `[2, 7, 6, 8, 8]`"
-msgstr "(**令牌级别**) `设备块编号`：`[2, 7, 6, 8, 8]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:261
-msgid "(**Token level**) `block offsets`: `[1, 0, 1, 0, 1]`"
-msgstr "(**令牌级别**) `块内偏移`：`[1, 0, 1, 0, 1]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:262
-msgid "(**Token level**) `slot mapping`: `[5, 14, 13, 16, 17]`"
-msgstr "(**令牌级别**) `槽映射`：`[5, 14, 13, 16, 17]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:264
-msgid "Scheduled token count: `[1, 1, 3]`"
-msgstr "已调度令牌数量：`[1, 1, 3]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:266
-msgid "`query start location`: `[0, 1, 2, 5]`"
-msgstr "`查询起始位置`：`[0, 1, 2, 5]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:268
-msgid "`sequence length`: `[4, 3, 8]`"
-msgstr "`序列长度`：`[4, 3, 8]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:270
-msgid "`number of computed tokens`: `[3, 2, 5]`"
-msgstr "`已计算令牌数`：`[3, 2, 5]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:274
-msgid "`max query len`: `3`"
-msgstr "`最大查询长度`：`3`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:276
-msgid "`slot mapping`: `[5, 14, 13, 16, 17]`"
-msgstr "`槽映射`：`[5, 14, 13, 16, 17]`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:278
-msgid "`attention mask`: `5 * 8`"
-msgstr "`注意力掩码`：`5 * 8`"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:280
-msgid "Each token has a `1 * 8` vector, and there are 5 scheduled tokens."
-msgstr "每个令牌有一个 `1 * 8` 的向量，共有 5 个已调度的令牌。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:282
-msgid "At last"
-msgstr "最后"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:284
-msgid ""
-"If you understand step 1 and step 2, you will know all the following "
-"steps."
-msgstr "如果您理解了步骤 1 和步骤 2，您就会知道所有后续步骤。"
-
-#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:286
-msgid ""
-"Hope this document helps you better understand how vLLM prepares inputs "
-"for model forwarding. If you have any good ideas, you are welcome to "
-"contribute to us."
-msgstr "希望本文档能帮助您更好地理解 vLLM 如何为模型前向传播准备输入。如果您有任何好的想法，欢迎向我们贡献。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/add_custom_aclnn_op.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/add_custom_aclnn_op.po
@@ -1,84 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:1
-msgid "Adding a custom aclnn operation"
-msgstr "添加自定义 aclnn 算子"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:3
-msgid ""
-"This document describes how to add a custom aclnn operation to vllm-"
-"ascend."
-msgstr "本文档描述了如何向 vllm-ascend 添加自定义 aclnn 算子。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:5
-msgid "How custom aclnn operation works in vllm-ascend?"
-msgstr "自定义 aclnn 算子在 vllm-ascend 中如何工作？"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:7
-msgid ""
-"Custom aclnn operations are built and installed into "
-"`vllm_ascend/cann_ops_custom` directory during the build process of vllm-"
-"ascend. Then the aclnn operators are bound to `torch.ops._C_ascend` "
-"module, enabling users to invoke them in vllm-ascend python code."
-msgstr "自定义 aclnn 算子在 vllm-ascend 的构建过程中被编译并安装到 `vllm_ascend/cann_ops_custom` 目录。然后，这些 aclnn 算子被绑定到 `torch.ops._C_ascend` 模块，使用户能够在 vllm-ascend 的 Python 代码中调用它们。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:9
-msgid "To enable custom operations, use the following code:"
-msgstr "要启用自定义算子，请使用以下代码："
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:17
-msgid "How to add a custom aclnn operation?"
-msgstr "如何添加自定义 aclnn 算子？"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:19
-msgid "Create a new operation folder under `csrc` directory."
-msgstr "在 `csrc` 目录下创建一个新的算子文件夹。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:20
-msgid ""
-"Create `op_host` and `op_kernel` directories for host and kernel source "
-"code."
-msgstr "为宿主端和内核源代码创建 `op_host` 和 `op_kernel` 目录。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:21
-msgid ""
-"Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that "
-"multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`."
-msgstr "在 `csrc/build_aclnn.sh` 中为支持的 SOC 添加构建选项。注意多个算子应用 `;` 分隔，例如 `CUSTOM_OPS=op1;op2;op3`。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:22
-msgid ""
-"Bind aclnn operators to torch.ops._C_ascend module in "
-"`csrc/torch_binding.cpp`."
-msgstr "在 `csrc/torch_binding.cpp` 中将 aclnn 算子绑定到 torch.ops._C_ascend 模块。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:23
-msgid ""
-"Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op "
-"to be captured into the aclgraph."
-msgstr "在 `csrc/torch_binding_meta.cpp` 中为算子编写一个元实现，以便其能被捕获到 aclgraph 中。"
-
-#: ../../source/developer_guide/Design_Documents/add_custom_aclnn_op.md:25
-msgid ""
-"After a successful build of vllm-ascend, the custom aclnn operation can "
-"be invoked in python code."
-msgstr "成功构建 vllm-ascend 后，即可在 Python 代码中调用自定义的 aclnn 算子。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/context_parallel.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/context_parallel.po
@@ -1,391 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:1
-msgid "Context Parallel (CP)"
-msgstr "上下文并行 (CP)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:3
-msgid ""
-"TL;DR PCP accelerates prefill via sequence splitting. DCP eliminates KV "
-"cache redundancy."
-msgstr "TL;DR PCP 通过序列分割加速预填充。DCP 消除 KV 缓存冗余。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:5
-msgid "![ContextParallel](../../assets/cp/overview.png)"
-msgstr "![ContextParallel](../../assets/cp/overview.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:5
-msgid "ContextParallel"
-msgstr "ContextParallel"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:7
-msgid ""
-"For the main discussions during the development process, please refer to "
-"the [RFC](https://github.com/vllm-project/vllm/issues/25749) and the "
-"relevant links referenced by or referencing this RFC."
-msgstr "关于开发过程中的主要讨论，请参阅 [RFC](https://github.com/vllm-project/vllm/issues/25749) 以及该 RFC 引用或被引用的相关链接。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:9
-msgid "What is CP?"
-msgstr "什么是 CP？"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:11
-msgid ""
-"**Context Parallel (CP)** is a strategy for parallelizing computation "
-"along the sequence dimension across multiple devices."
-msgstr "**上下文并行 (CP)** 是一种沿序列维度在多个设备间并行计算的策略。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:13
-msgid ""
-"**Prefill Context Parallel (PCP)** expands the world size of devices and "
-"uses dedicated communication domains. Its primary goal is to partition "
-"the sequence dimension during the prefill phase, enabling different "
-"devices to compute distinct chunks of the sequence simultaneously. The KV"
-" cache is sharded along the sequence dimension across devices. This "
-"approach impacts the computational logic of both the Prefill and Decode "
-"stages to varying degrees."
-msgstr "**预填充上下文并行 (PCP)** 扩展了设备的世界大小并使用专用的通信域。其主要目标是在预填充阶段对序列维度进行分区，使不同设备能同时计算序列的不同分块。KV 缓存沿序列维度跨设备分片。此方法在不同程度上影响了预填充和解码阶段的计算逻辑。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:18
-msgid ""
-"**Decode Context Parallel (DCP)** reuses the communication domain of "
-"Tensor Parallelism (TP) and does not require additional devices. Its main"
-" objective is to eliminate duplicated storage of the KV cache by sharding"
-" it along the sequence dimension across devices within the TP domain that"
-" would otherwise hold redundant copies. DCP primarily influences the "
-"Decode logic, as well as the logic for chunked prefill and cached "
-"prefill."
-msgstr "**解码上下文并行 (DCP)** 复用张量并行 (TP) 的通信域，且不需要额外的设备。其主要目标是通过在 TP 域内沿序列维度对 KV 缓存进行分片，消除原本会存储冗余副本的设备间的重复存储。DCP 主要影响解码逻辑，以及分块预填充和缓存预填充的逻辑。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:22
-msgid "How to Use CP?"
-msgstr "如何使用 CP？"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:24
-msgid ""
-"Please refer to the [context parallel user "
-"guide](../../user_guide/feature_guide/context_parallel.md) for detailed "
-"information."
-msgstr "详细信息请参阅 [上下文并行用户指南](../../user_guide/feature_guide/context_parallel.md)。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:26
-msgid "How It Works?"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:28
-msgid "Device Distribution"
-msgstr "设备分布"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:30
-msgid ""
-"We introduce new communication domains for PCP and reuse TP for DCP, and "
-"this is the new layout of devices for PCP2, DCP2, and TP4. "
-"![device_world](../../assets/cp/device_world.png)"
-msgstr "我们为 PCP 引入了新的通信域，并为 DCP 复用了 TP 的通信域，这是 PCP2、DCP2 和 TP4 的新设备布局。![device_world](../../assets/cp/device_world.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:30
-msgid "device_world"
-msgstr "device_world"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:33
-msgid "Block Table"
-msgstr "块表"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:35
-msgid ""
-"CP performs sequence sharding on the KV cache storage. To facilitate "
-"efficient storage and access, tokens are stored in an interleaved manner "
-"across devices, with the interleaving granularity determined by "
-"`cp_kv_cache_interleave_size`, whose default value is "
-"`cp_kv_cache_interleave_size=1`, a.k.a. 'token interleave'."
-msgstr "CP 对 KV 缓存存储执行序列分片。为了便于高效存储和访问，令牌以交错方式跨设备存储，交错粒度由 `cp_kv_cache_interleave_size` 决定，其默认值为 `cp_kv_cache_interleave_size=1`，也称为“令牌交错”。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:37
-msgid ""
-"Given that PCP and DCP behave similarly for KV cache sharding, we refer "
-"to them collectively as CP. Specifically, `cp_size = pcp_size * "
-"dcp_size`, and `cp_rank = pcp_rank * dcp_size + dcp_rank`."
-msgstr "鉴于 PCP 和 DCP 在 KV 缓存分片方面的行为相似，我们将它们统称为 CP。具体来说，`cp_size = pcp_size * dcp_size`，且 `cp_rank = pcp_rank * dcp_size + dcp_rank`。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:39
-msgid ""
-"As illustrated, a virtual block is defined in the block table, where "
-"blocks within the same CP device group form a virtual block. The virtual "
-"block size is `virtual_block_size = block_size * cp_size`."
-msgstr "如图所示，块表中定义了一个虚拟块，同一 CP 设备组内的块构成一个虚拟块。虚拟块大小为 `virtual_block_size = block_size * cp_size`。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:41
-#, python-format
-msgid ""
-"For any token `x`, referencing the following figure, its (virtual) block "
-"index is `x // virtual_block_size`, and the offset within the virtual "
-"block is `offset_within_virtual_block = x % virtual_block_size`. The "
-"local block index is `local_block_index = offset_within_virtual_block // "
-"cp_kv_cache_interleave_size`, and the device number is `target_rank = "
-"local_block_index % cp_size`. The offset within the local block is "
-"`(local_block_index // cp_size) * cp_kv_cache_interleave_size + "
-"offset_within_virtual_block % cp_kv_cache_interleave_size`."
-msgstr "对于任意令牌 `x`，参考下图，其（虚拟）块索引为 `x // virtual_block_size`，在虚拟块内的偏移量为 `offset_within_virtual_block = x % virtual_block_size`。本地块索引为 `local_block_index = offset_within_virtual_block // cp_kv_cache_interleave_size`，设备号为 `target_rank = local_block_index % cp_size`。在本地块内的偏移量为 `(local_block_index // cp_size) * cp_kv_cache_interleave_size + offset_within_virtual_block % cp_kv_cache_interleave_size`。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:45
-msgid "![BlockTable](../../assets/cp/blocktable.png)"
-msgstr "![BlockTable](../../assets/cp/blocktable.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:45
-msgid "BlockTable"
-msgstr "BlockTable"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:47
-msgid ""
-"Based on the logic above, the `slot_mapping` calculation process is "
-"adjusted, and the `slot_mapping` values on each device are modified to "
-"ensure the KV cache is sharded along the sequence dimension and stored "
-"across different devices as expected."
-msgstr "基于上述逻辑，调整了 `slot_mapping` 的计算过程，并修改了每个设备上的 `slot_mapping` 值，以确保 KV 缓存沿序列维度分片并按预期存储在不同设备上。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:49
-#, python-format
-msgid ""
-"The current implementation requires that `block_size % "
-"cp_kv_cache_interleave_size == 0`."
-msgstr "当前实现要求 `block_size % cp_kv_cache_interleave_size == 0`。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:51
-msgid "Decode Context Parallel (DCP)"
-msgstr "解码上下文并行 (DCP)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:53
-msgid ""
-"As mentioned above, the primary function of DCP is to shard the KV cache "
-"along the sequence dimension for storage. Its impact lies in the logic of"
-" the decode and chunked prefill phases."
-msgstr "如上所述，DCP 的主要功能是沿序列维度对 KV 缓存进行分片存储。其影响在于解码和分块预填充阶段的逻辑。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:55
-msgid ""
-"**Prefill Phase:**   As illustrated, during the Chunked Prefill "
-"computation, two distinct logic implementations are employed for MLA and "
-"GQA backends."
-msgstr "**预填充阶段：** 如图所示，在分块预填充计算期间，MLA 和 GQA 后端采用了两种不同的逻辑实现。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:58
-msgid ""
-"In the **MLA backend**, a Context KV Cache `all_gather` operation is "
-"performed to aggregate the full KV values. These are then used for "
-"attention computation with the Q values of the current chunk. Note that "
-"in multi-request scenarios, the directly gathered KV results are "
-"interleaved across requests. The `reorg_kvcache` function is used to "
-"reorganize the KV cache, ensuring that the KV cache of the same request "
-"is stored contiguously."
-msgstr "在 **MLA 后端** 中，执行上下文 KV 缓存 `all_gather` 操作以聚合完整的 KV 值。然后这些值与当前分块的 Q 值一起用于注意力计算。请注意，在多请求场景中，直接收集的 KV 结果在请求间是交错的。使用 `reorg_kvcache` 函数来重新组织 KV 缓存，确保同一请求的 KV 缓存被连续存储。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:63
-msgid ""
-"In the **GQA backend**, an `all_gather` is performed along the head "
-"dimension for Q. This is because DCP overlaps with the TP communication "
-"domain, and the Q heads within a DCP group differ. However, they need to "
-"exchange results with the locally computed KV cache for online Softmax "
-"updates. To ensure correctness during result updates, the Q values are "
-"synchronized across the DCP group via head-dimension `all_gather`. During"
-" the result update process, `cp_lse_ag_out_rs` is invoked to aggregate "
-"`attn_output` and `attn_lse`, update the results, and perform a reduce-"
-"scatter operation on the outputs. Alternatively, we can use an all-to-all"
-" communication to exchange the output and LSE results, followed by direct"
-" local updates. This approach aligns with the logic adapted for PCP "
-"compatibility."
-msgstr "在 **GQA 后端** 中，沿头维度对 Q 执行 `all_gather`。这是因为 DCP 与 TP 通信域重叠，且 DCP 组内的 Q 头不同。然而，它们需要与本地计算的 KV 缓存交换结果以进行在线 Softmax 更新。为确保结果更新过程中的正确性，Q 值通过头维度的 `all_gather` 在 DCP 组内同步。在结果更新过程中，调用 `cp_lse_ag_out_rs` 来聚合 `attn_output` 和 `attn_lse`，更新结果，并对输出执行 reduce-scatter 操作。或者，我们可以使用 all-to-all 通信来交换输出和 LSE 结果，然后直接进行本地更新。这种方法与为 PCP 兼容性而调整的逻辑一致。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:70
-msgid "![DCP-Prefill](../../assets/cp/dcp-prefill.png)"
-msgstr "![DCP-Prefill](../../assets/cp/dcp-prefill.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:70
-msgid "DCP-Prefill"
-msgstr "DCP-Prefill"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:72
-msgid ""
-"**Decode Phase:** The logic during the decode phase is consistent with "
-"that of GQA's chunked prefill: an all-gather operation is first performed"
-" along the Q head dimension to ensure consistency within the DCP group. "
-"After computing the results with the local KV cache, the results are "
-"updated via the `cp_lse_ag_out_rs` function."
-msgstr "**解码阶段：** 解码阶段的逻辑与 GQA 的分块预填充一致：首先沿 Q 头维度执行 all-gather 操作以确保 DCP 组内的一致性。使用本地 KV 缓存计算结果后，通过 `cp_lse_ag_out_rs` 函数更新结果。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:76
-msgid "![DCP-Decode](../../assets/cp/dcp-decode.png)"
-msgstr "![DCP-Decode](../../assets/cp/dcp-decode.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:76
-msgid "DCP-Decode"
-msgstr "DCP-Decode"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:78
-msgid "Prefill Context Parallel (PCP)"
-msgstr "预填充上下文并行 (PCP)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:80
-msgid "**Tokens Partition in Head-Tail Style**"
-msgstr "**头尾式令牌分区**"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:82
-msgid ""
-"PCP requires splitting the input sequence and ensuring balanced "
-"computational load across devices during the prefill phase. We employ a "
-"head-tail style for splitting and concatenation: specifically, the "
-"sequence is first padded to a length of `2*pcp_size`, then divided into "
-"`2*pcp_size` equal parts. The first part is merged with the last part, "
-"the second part with the second last part, and so on, thereby assigning "
-"computationally balanced chunks to each device. Additionally, since "
-"allgather aggregation of KV or Q results in interleaved chunks from "
-"different requests, we compute `pcp_allgather_restore_idx` to quickly "
-"restore the original order."
-msgstr "PCP 需要在预填充阶段分割输入序列并确保跨设备的计算负载均衡。我们采用头尾式进行分割和连接：具体来说，首先将序列填充到长度为 `2*pcp_size`，然后分成 `2*pcp_size` 个相等的部分。第一部分与最后一部分合并，第二部分与倒数第二部分合并，依此类推，从而为每个设备分配计算上均衡的分块。此外，由于 KV 或 Q 的 allgather 聚合会导致来自不同请求的交错分块，我们计算 `pcp_allgather_restore_idx` 以快速恢复原始顺序。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:87
-msgid "These logics are implemented in the function `_update_tokens_for_pcp`."
-msgstr "这些逻辑在函数 `_update_tokens_for_pcp` 中实现。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:89
-msgid "![PCP-Partition](../../assets/cp/head-tail-style.png)"
-msgstr "![PCP-Partition](../../assets/cp/head-tail-style.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:89
-msgid "PCP-Partition"
-msgstr "PCP-Partition"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:91
-msgid "**Prefill Phase:**"
-msgstr "**预填充阶段：**"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:93
-msgid ""
-"During the Prefill phase (excluding chunked prefill), we employ an all-"
-"gather KV approach to address the issue of incomplete sequences on "
-"individual GPUs. It is important to note that we only aggregate the KV "
-"values for the current layer at a time, and these are discarded "
-"immediately after use, avoiding excessive peak memory usage. This method "
-"can also be directly applied to KV cache storage (since the KV cache "
-"partitioning method differs from PCP sequence partitioning, it is "
-"inevitable that each GPU requires a complete copy of the KV values). All "
-"attention backends maintain consistency in this logic."
-msgstr "在预填充阶段（不包括分块预填充），我们采用 all-gather KV 的方法来解决单个 GPU 上序列不完整的问题。需要注意的是，我们一次只聚合当前层的 KV 值，并且在使用后立即丢弃，以避免过高的峰值内存使用。此方法也可直接应用于 KV 缓存存储（由于 KV 缓存的分区方法与 PCP 序列分区不同，每个 GPU 都需要一份完整的 KV 值副本是不可避免的）。所有注意力后端在此逻辑上保持一致。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:98
-msgid ""
-"Note: While a Ring Attention approach could also facilitate information "
-"exchange with lower peak memory and enable computation-communication "
-"overlap, we prioritized the all-gather KV implementation after evaluating"
-" that the development complexity was high and the benefits of overlap "
-"were limited."
-msgstr "注意：虽然环形注意力方法也能以更低的峰值内存促进信息交换并实现计算-通信重叠，但在评估了开发复杂度高且重叠收益有限后，我们优先实现了 all-gather KV 方案。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:100
-msgid "![PCP-Prefill](../../assets/cp/pcp-prefill.png)"
-msgstr "![PCP-Prefill](../../assets/cp/pcp-prefill.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:100
-msgid "PCP-Prefill"
-msgstr "PCP-Prefill"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:102
-msgid "**Decode Phase:**"
-msgstr "**解码阶段：**"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:104
-msgid ""
-"During the decode phase, we only need to add an allgather within the PCP "
-"group after the DCP all-to-all communication exchanges the output and "
-"LSE, before proceeding with the output update."
-msgstr "在解码阶段，我们只需要在 DCP all-to-all 通信交换输出和 LSE 之后，于 PCP 组内添加一个 allgather，然后再进行输出更新。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:106
-msgid "![PCP-Decode](../../assets/cp/pcp-decode.png)"
-msgstr "![PCP-Decode](../../assets/cp/pcp-decode.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:106
-msgid "PCP-Decode"
-msgstr "PCP-Decode"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:108
-msgid "**Chunked Prefill:**"
-msgstr "**分块预填充：**"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:110
-msgid ""
-"Currently, there are three viable approaches for Chunked Prefill "
-"compatibility: **AllGatherQ**, **AllGatherKV**, and **Ring-Attn**. Since "
-"PCP performs sequence sharding on both the query sequence and the KV "
-"cache, we need to ensure that one side has complete information or employ"
-" a method like Ring-Attn to perform computations sequentially. The "
-"advantages and disadvantages of Ring-Attn will not be elaborated here."
-msgstr "目前，有三种可行的分块预填充兼容性方法：**AllGatherQ**、**AllGatherKV** 和 **Ring-Attn**。由于 PCP 对查询序列和 KV 缓存都执行序列分片，我们需要确保其中一方拥有完整信息，或者采用类似 Ring-Attn 的方法顺序执行计算。Ring-Attn 的优缺点在此不赘述。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:114
-msgid ""
-"We have implemented the **AllGatherQ** approach in the GQA attention "
-"backend and the **AllGatherKV** approach in the MLA attention backend. "
-"The workflow after **AllGatherQ** is identical to the decode phase, while"
-" the workflow after **AllGatherKV** is the same as the standard prefill "
-"phase. For details, please refer to the diagram below; specific steps "
-"will not be repeated."
-msgstr "我们已在 GQA 注意力后端实现了 **AllGatherQ** 方法，并在 MLA 注意力后端实现了 **AllGatherKV** 方法。**AllGatherQ** 之后的工作流与解码阶段相同，而 **AllGatherKV** 之后的工作流与标准预填充阶段相同。详情请参考下图；具体步骤不再赘述。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:118
-msgid ""
-"One important note: **AllGatherKV** may lead to significant peak memory "
-"usage when the context length becomes excessively long. To mitigate this,"
-" we adopt a segmented processing strategy. By predefining the maximum "
-"amount of KV cache processed per round, we sequentially complete the "
-"attention computation and online softmax updates for each segment."
-msgstr ""
-"一个重要注意事项：当上下文长度变得过长时，**AllGatherKV** 可能导致显著的峰值内存使用。为了缓解这个问题，我们采用了分段处理策略。通过预定义每轮处理的 KV 缓存最大量，我们依次完成每个分段的注意力计算和在线 softmax 更新。"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:122
-msgid "![PCP-ChunkedPrefill](../../assets/cp/chunkedprefill.png)"
-msgstr "![PCP-ChunkedPrefill](../../assets/cp/chunkedprefill.png)"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:122
-msgid "PCP-ChunkedPrefill"
-msgstr "PCP-ChunkedPrefill"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:124
-msgid "Related Files"
-msgstr "相关文件"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:126
-msgid "slot_mapping computation: `vllm_ascend/worker/block_table.py`"
-msgstr "slot_mapping 计算：`vllm_ascend/worker/block_table.py`"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:127
-msgid ""
-"sequences splitting and metadata prepare: "
-"`vllm_ascend/worker/model_runner_v1.py`"
-msgstr "序列拆分与元数据准备：`vllm_ascend/worker/model_runner_v1.py`"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:128
-msgid "GQA backend: `vllm_ascend/attention/attention_cp.py`"
-msgstr "GQA 后端：`vllm_ascend/attention/attention_cp.py`"
-
-#: ../../source/developer_guide/Design_Documents/context_parallel.md:129
-msgid "MLA backend: `vllm_ascend/attention/mla_cp.py`"
-msgstr "MLA 后端：`vllm_ascend/attention/mla_cp.py`"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po
@@ -1,826 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:1
-msgid "CPU Binding"
-msgstr "CPU 绑定"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:3
-msgid "Overview"
-msgstr "概述"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:5
-msgid ""
-"CPU binding pins vLLM Ascend worker processes and key threads to specific"
-" CPU cores to reduce CPU–NPU cross‑NUMA traffic and stabilize latency "
-"under multi‑process workloads. It is designed for ARM servers running "
-"Ascend NPUs and is automatically executed during worker initialization "
-"when enabled."
-msgstr ""
-"CPU 绑定将 vLLM Ascend 工作进程和关键线程固定到特定的 CPU 核心，以减少 CPU-NPU 跨 NUMA "
-"流量，并在多进程工作负载下稳定延迟。它专为运行 Ascend NPU 的 ARM 服务器设计，启用后会在工作进程初始化期间自动执行。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:7
-msgid "Background"
-msgstr "背景"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:9
-msgid ""
-"On multi‑socket ARM systems, the OS scheduler may place vLLM threads on "
-"CPUs far from the local NPU, causing NUMA cross‑traffic and jitter. CPU "
-"binding enforces a deterministic CPU placement strategy and optionally "
-"binds NPU IRQs to the same CPU pool. This is distinct from other "
-"performance features (e.g., graph mode or dynamic batch) because it is "
-"purely a host‑side affinity policy and does not change model execution "
-"logic."
-msgstr ""
-"在多插槽 ARM 系统上，操作系统调度器可能会将 vLLM 线程放置在远离本地 NPU 的 CPU 上，从而导致 NUMA "
-"跨域流量和延迟抖动。CPU 绑定强制执行一种确定性的 CPU 放置策略，并可选地将 NPU IRQ 绑定到同一个 CPU "
-"池。这与其他性能特性（如图模式或动态批处理）不同，因为它纯粹是主机端的亲和性策略，不改变模型执行逻辑。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:11
-msgid "Design & How it works"
-msgstr "设计与工作原理"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:13
-msgid "Key concepts"
-msgstr "关键概念"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:15
-msgid ""
-"**Allowed CPU list**: The cpuset from /proc/self/status "
-"(Cpus_allowed_list). All allocations are constrained to this list."
-msgstr ""
-"**允许的 CPU 列表**：来自 /proc/self/status (Cpus_allowed_list) 的 "
-"cpuset。所有分配都受限于此列表。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:16
-msgid ""
-"**Running NPU list**: Logical NPU IDs extracted from npu‑smi process "
-"listing, optionally filtered by ASCEND_RT_VISIBLE_DEVICES."
-msgstr ""
-"**运行中的 NPU 列表**：从 npu-smi 进程列表中提取的逻辑 NPU ID，可选地由 "
-"ASCEND_RT_VISIBLE_DEVICES 过滤。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:17
-msgid ""
-"**CPU pool per NPU**: The CPU list assigned to each logical NPU ID based "
-"on the binding mode."
-msgstr "**每个 NPU 的 CPU 池**：根据绑定模式分配给每个逻辑 NPU ID 的 CPU 列表。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:18
-msgid "**Binding modes & Device behavior**:"
-msgstr "**绑定模式与设备行为**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Device type"
-msgstr "设备类型"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Default mode"
-msgstr "默认模式"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Description"
-msgstr "描述"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "A3 (No Affinity)"
-msgstr "A3 (无亲和性)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`global_slice`"
-msgstr "`global_slice`"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid ""
-"Splits the allowed CPU list evenly based on the **total number of global "
-"logical NPUs**, ensuring each NPU is assigned a contiguous segment of CPU"
-" cores. This prevents CPU core overlap across multiple process groups."
-msgstr ""
-"根据**全局逻辑 NPU 总数**均匀分割允许的 CPU 列表，确保每个 NPU 被分配一个连续的 CPU 核心段。这可以防止多个进程组之间的 "
-"CPU 核心重叠。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "A2 / 310P / Others"
-msgstr "A2 / 310P / 其他"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`topo_affinity`"
-msgstr "`topo_affinity`"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid ""
-"Allocates CPUs based on NPU topology affinity (`npu‑smi info -t topo`). "
-"If multiple NPUs are assigned to a single NUMA node (which may cause "
-"bandwidth contention), the CPU allocation extends to adjacent NUMA nodes."
-msgstr ""
-"基于 NPU 拓扑亲和性 (`npu-smi info -t topo`) 分配 CPU。如果多个 NPU 被分配到单个 NUMA "
-"节点（可能导致带宽争用），则 CPU 分配会扩展到相邻的 NUMA 节点。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:25
-msgid "**Default**: enabled (enable_cpu_binding = true)."
-msgstr "**默认**：启用 (enable_cpu_binding = true)。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:26
-msgid "**Fallback**: If NPU topo affinity is unavailable, global_slice is used."
-msgstr "**回退**：如果 NPU 拓扑亲和性不可用，则使用 global_slice。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:27
-msgid ""
-"**Failure handling**: Any exception in binding is logged as a warning and"
-" **binding is skipped for that rank**."
-msgstr "**故障处理**：绑定过程中的任何异常都会记录为警告，并且**跳过该等级的绑定**。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:29
-msgid "Execution flow (simplified)"
-msgstr "执行流程（简化版）"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:31
-msgid ""
-"**Feature entry**: worker initialization calls `bind_cpus(local_rank)` "
-"when `enable_cpu_binding` is true."
-msgstr ""
-"**功能入口**：当 `enable_cpu_binding` 为 true 时，工作进程初始化会调用 "
-"`bind_cpus(local_rank)`。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:32
-msgid ""
-"**CPU architecture gate**: If the CPU is not ARM, binding is skipped with"
-" a log."
-msgstr "**CPU 架构门控**：如果 CPU 不是 ARM，则记录日志并跳过绑定。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:33
-msgid "**Collect device info**:"
-msgstr "**收集设备信息**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:34
-msgid "Map logical NPU IDs from `npu‑smi info -m`."
-msgstr "从 `npu-smi info -m` 映射逻辑 NPU ID。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:35
-msgid "Detect running NPU IDs from npu‑smi info process table."
-msgstr "从 npu-smi info 进程表中检测运行中的 NPU ID。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:36
-msgid "Read cpuset from /proc/self/status."
-msgstr "从 /proc/self/status 读取 cpuset。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:37
-msgid "Read topo affinity from `npu‑smi info -t topo`."
-msgstr "从 `npu-smi info -t topo` 读取拓扑亲和性。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:38
-msgid "**Build CPU pools**:"
-msgstr "**构建 CPU 池**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:39
-msgid ""
-"Use **global_slice** for A3 devices; **topo_affinity** for A2 and Atlas "
-"300 inference products."
-msgstr "对 A3 设备使用 **global_slice**；对 A2 和 Atlas 300 推理产品使用 **topo_affinity**。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:40
-msgid "If topo affinity is missing, fall back to global_slice."
-msgstr "如果缺少拓扑亲和性，则回退到 global_slice。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:41
-msgid "Ensure each NPU has at least 5 CPUs."
-msgstr "确保每个 NPU 至少有 5 个 CPU。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:42
-msgid "**Allocate per‑role CPUs**:"
-msgstr "**分配按角色划分的 CPU**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:43
-msgid "Reserve the first two CPUs for IRQ binding."
-msgstr "保留前两个 CPU 用于 IRQ 绑定。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:44
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:62
-msgid "`main`: pool[2:-2]"
-msgstr "`main`: pool[2:-2]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:45
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:63
-msgid "`acl`: pool[-2]"
-msgstr "`acl`: pool[-2]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:46
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:64
-msgid "`release`: pool[-1]"
-msgstr "`release`: pool[-1]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:47
-msgid "**Bind threads**:"
-msgstr "**绑定线程**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:48
-msgid "Main process is pinned to `main` CPUs."
-msgstr "主进程被固定到 `main` CPU。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:49
-msgid "ACL threads (named with acl_thread) are pinned to `acl` CPU."
-msgstr "ACL 线程（以 acl_thread 命名）被固定到 `acl` CPU。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:50
-msgid "Release threads (named with release_thread) are pinned to `release` CPU."
-msgstr "释放线程（以 release_thread 命名）被固定到 `release` CPU。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:51
-msgid "**Bind NPU IRQs (optional)**:"
-msgstr "**绑定 NPU IRQ（可选）**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:52
-msgid ""
-"If /proc/irq is writable, bind SQ/CQ IRQs to the first two CPUs in the "
-"pool."
-msgstr "如果 /proc/irq 可写，则将 SQ/CQ IRQ 绑定到池中的前两个 CPU。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:53
-msgid "irqbalance may be stopped to prevent overrides."
-msgstr "可能会停止 irqbalance 以防止覆盖。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:54
-msgid "**Memory binding (optional)**:"
-msgstr "**内存绑定（可选）**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:55
-msgid ""
-"If migratepages is available, memory for ACL threads is migrated to the "
-"NPU’s NUMA node."
-msgstr "如果 migratepages 可用，则将 ACL 线程的内存迁移到 NPU 的 NUMA 节点。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:57
-msgid "Allocation plan examples"
-msgstr "分配方案示例"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:59
-msgid ""
-"The allocation plan is derived directly from the CPU pool per NPU and "
-"then split into roles:"
-msgstr "分配方案直接来源于每个 NPU 的 CPU 池，然后按角色划分："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:61
-msgid "IRQ CPUs: pool[0], pool[1]"
-msgstr "IRQ CPU: pool[0], pool[1]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:66
-msgid "Below are concrete examples that reflect the actual code paths."
-msgstr "以下是反映实际代码路径的具体示例。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:68
-msgid "Example 1: A3 inference server with 640 CPUs and 16 NPUs"
-msgstr "示例 1：具有 640 个 CPU 和 16 个 NPU 的 A3 推理服务器"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:70
-msgid "allowed_cpus = [0..639] (640 CPUs)"
-msgstr "allowed_cpus = [0..639] (640 个 CPU)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:71
-msgid "NUMA nodes = 0..7 (8 NUMA nodes, symmetric layout)"
-msgstr "NUMA 节点 = 0..7 (8 个 NUMA 节点，对称布局)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:72
-msgid "total_npus = 16"
-msgstr "total_npus = 16"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:73
-msgid "running_npu_list = [0..15]"
-msgstr "running_npu_list = [0..15]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:74
-msgid "base = 640 // 16 = 40, extra = 0"
-msgstr "base = 640 // 16 = 40, extra = 0"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:75
-msgid "Each NPU gets a 40‑CPU pool."
-msgstr "每个 NPU 获得一个 40 个 CPU 的池。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "NPU ID"
-msgstr "NPU ID"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Assigned CPU Cores (global_slice)"
-msgstr "分配的 CPU 核心 (global_slice)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Role Division (IRQ/Main/ACL/Release)"
-msgstr "角色划分 (IRQ/Main/ACL/Release)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "0"
-msgstr "0"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "0-39"
-msgstr "0-39"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 0-1, `Main`: 2-37, `ACL`: 38, `Release`: 39"
-msgstr "`IRQ`: 0-1, `Main`: 2-37, `ACL`: 38, `Release`: 39"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "1"
-msgstr "1"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "40-79"
-msgstr "40-79"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 40-41, `Main`: 42-77, `ACL`: 78, `Release`: 79"
-msgstr "`IRQ`: 40-41, `Main`: 42-77, `ACL`: 78, `Release`: 79"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "..."
-msgstr "..."
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "15"
-msgstr "15"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "600-639"
-msgstr "600-639"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 600-601, `Main`: 602-637, `ACL`: 638, `Release`: 639"
-msgstr "`IRQ`: 600-601, `Main`: 602-637, `ACL`: 638, `Release`: 639"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:84
-msgid ""
-"This layout remains deterministic even when multiple processes share the "
-"same cpuset, because slicing is based on the global logical NPU ID."
-msgstr "即使多个进程共享同一个 cpuset，此布局也保持确定性，因为切片是基于全局逻辑 NPU ID 的。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:86
-msgid "Example 2: A3 global_slice, even split"
-msgstr "示例 2：A3 global_slice，均匀分割"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:88
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:109
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:142
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:161
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:182
-msgid "**Inputs**:"
-msgstr "**输入**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:90
-msgid "allowed_cpus = [0..23] (24 CPUs)"
-msgstr "allowed_cpus = [0..23] (24个CPU)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:91
-msgid ""
-"NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..11, NUMA1 ="
-" 12..23)"
-msgstr "NUMA 节点 = 0..1 (2个NUMA节点，对称布局；NUMA0 = 0..11, NUMA1 = 12..23)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:92
-msgid "total_npus = 4 (from npu-smi info -m)"
-msgstr "total_npus = 4 (来自 npu-smi info -m)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:93
-msgid "running_npu_list = [0, 1, 2, 3]"
-msgstr "running_npu_list = [0, 1, 2, 3]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:95
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:116
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:149
-msgid "**Global slice**:"
-msgstr "**全局切片**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:97
-msgid "base = 24 // 4 = 6, extra = 0"
-msgstr "base = 24 // 4 = 6, extra = 0"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:98
-msgid "Each NPU gets a 6‑CPU pool."
-msgstr "每个NPU获得一个包含6个CPU的池。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "0-5"
-msgstr "0-5"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 0-1, `Main`: 2-3, `ACL`: 4, `Release`: 5"
-msgstr "`IRQ`: 0-1, `Main`: 2-3, `ACL`: 4, `Release`: 5"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "6-11"
-msgstr "6-11"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 6-7, `Main`: 8-9, `ACL`: 10, `Release`: 11"
-msgstr "`IRQ`: 6-7, `Main`: 8-9, `ACL`: 10, `Release`: 11"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "2"
-msgstr "2"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "12-17"
-msgstr "12-17"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 12-13, `Main`: 14-15, `ACL`: 16, `Release`: 17"
-msgstr "`IRQ`: 12-13, `Main`: 14-15, `ACL`: 16, `Release`: 17"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "3"
-msgstr "3"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "18-23"
-msgstr "18-23"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 18-19, `Main`: 20-21, `ACL`: 22, `Release`: 23"
-msgstr "`IRQ`: 18-19, `Main`: 20-21, `ACL`: 22, `Release`: 23"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:107
-msgid "Example 3: A3 global_slice, remainder distribution"
-msgstr "示例 3: A3 global_slice，余数分配"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:111
-msgid "allowed_cpus = [0..16] (17 CPUs)"
-msgstr "allowed_cpus = [0..16] (17个CPU)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:112
-msgid ""
-"NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..7, NUMA1 = "
-"8..16)"
-msgstr "NUMA 节点 = 0..1 (2个NUMA节点，对称布局；NUMA0 = 0..7, NUMA1 = 8..16)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:113
-msgid "total_npus = 3"
-msgstr "total_npus = 3"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:114
-msgid "running_npu_list = [0, 1, 2]"
-msgstr "running_npu_list = [0, 1, 2]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:118
-msgid "base = 17 // 3 = 5, extra = 2"
-msgstr "base = 17 // 3 = 5, extra = 2"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:119
-msgid "NPU0 pool size = 6 (base+1)"
-msgstr "NPU0 池大小 = 6 (base+1)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:120
-msgid "NPU1 pool size = 6 (base+1)"
-msgstr "NPU1 池大小 = 6 (base+1)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:121
-msgid "NPU2 pool size = 5 (base)"
-msgstr "NPU2 池大小 = 5 (base)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "12-16"
-msgstr "12-16"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 12-13, `Main`: 14, `ACL`: 15, `Release`: 16"
-msgstr "`IRQ`: 12-13, `Main`: 14, `ACL`: 15, `Release`: 16"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:129
-msgid ""
-"Note: When a pool size is exactly 5, `main` has a single CPU (pool[2]). "
-"If any pool is <5, binding raises an error."
-msgstr "注意：当池大小恰好为5时，`main` 只有一个CPU (pool[2])。如果任何池小于5，绑定将引发错误。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:131
-msgid "**NUMA analysis**:"
-msgstr "**NUMA 分析**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:133
-msgid ""
-"With the symmetric NUMA layout above (NUMA0 = 0..7, NUMA1 = 8..16), NPU0 "
-"stays within NUMA0, NPU2 stays within NUMA1, but NPU1 spans both NUMA0 "
-"(6,7) and NUMA1 (8..11). This is a direct consequence of global slicing "
-"over the ordered cpuset; the remainder distribution does not enforce NUMA"
-" boundaries."
-msgstr ""
-"在上述对称NUMA布局中 (NUMA0 = 0..7, NUMA1 = "
-"8..16)，NPU0保持在NUMA0内，NPU2保持在NUMA1内，但NPU1跨越了NUMA0 (6,7) 和 NUMA1 "
-"(8..11)。这是对有序cpuset进行全局切片的直接结果；余数分配不强制NUMA边界。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:134
-msgid ""
-"If the cpuset numbering is interleaved across NUMA nodes (non‑symmetric "
-"layout), cross‑NUMA pools can happen even earlier. This is why symmetric "
-"NUMA layout is recommended for best locality."
-msgstr "如果cpuset编号在NUMA节点间交错（非对称布局），跨NUMA池可能更早发生。这就是为什么推荐对称NUMA布局以获得最佳局部性。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:136
-msgid "Known limitations and future improvements"
-msgstr "已知限制与未来改进"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:138
-msgid ""
-"With the current `global_slice` strategy, some CPU/NPU layouts cannot "
-"avoid cross‑NUMA pools. A future enhancement should incorporate NUMA node"
-" boundaries into the slicing logic so that pools remain within a single "
-"NUMA node whenever possible."
-msgstr ""
-"使用当前的 `global_slice` "
-"策略，某些CPU/NPU布局无法避免跨NUMA池。未来的增强应将NUMA节点边界纳入切片逻辑，以便池尽可能保持在单个NUMA节点内。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:140
-msgid "Example 4: global_slice with visible subset of NPUs"
-msgstr "示例 4: 使用NPU可见子集的 global_slice"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:144
-msgid "total_npus = 8 (from npu-smi info -m)"
-msgstr "total_npus = 8 (来自 npu-smi info -m)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:145
-msgid "running_npu_list = [2, 3] (filtered by ASCEND_RT_VISIBLE_DEVICES)"
-msgstr "running_npu_list = [2, 3] (由 ASCEND_RT_VISIBLE_DEVICES 过滤)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:146
-msgid "allowed_cpus = [0..39] (40 CPUs)"
-msgstr "allowed_cpus = [0..39] (40个CPU)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:147
-msgid ""
-"NUMA nodes = 0..3 (4 NUMA nodes, symmetric layout; 0..9, 10..19, 20..29, "
-"30..39)"
-msgstr "NUMA 节点 = 0..3 (4个NUMA节点，对称布局；0..9, 10..19, 20..29, 30..39)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:151
-msgid "base = 40 // 8 = 5, extra = 0"
-msgstr "base = 40 // 8 = 5, extra = 0"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:152
-msgid ""
-"Only the visible logical NPUs get pools, but slicing uses the global NPU "
-"ID so different processes do not overlap."
-msgstr "只有可见的逻辑NPU获得池，但切片使用全局NPU ID，因此不同进程不会重叠。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "10-14"
-msgstr "10-14"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 10-11, `Main`: 12, `ACL`: 13, `Release`: 14"
-msgstr "`IRQ`: 10-11, `Main`: 12, `ACL`: 13, `Release`: 14"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "15-19"
-msgstr "15-19"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 15-16, `Main`: 17, `ACL`: 18, `Release`: 19"
-msgstr "`IRQ`: 15-16, `Main`: 17, `ACL`: 18, `Release`: 19"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:159
-msgid "Example 5: A2/310P topo_affinity with NUMA extension"
-msgstr "示例 5: 具有NUMA扩展的 A2/310P topo_affinity"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:163
-#, python-brace-format
-msgid "npu_affinity = {0: [0..7], 1: [0..7]} (from `npu-smi info -t topo`)"
-msgstr "npu_affinity = {0: [0..7], 1: [0..7]} (来自 `npu-smi info -t topo`)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:164
-msgid "allowed_cpus = [0..15] (16 CPUs)"
-msgstr "allowed_cpus = [0..15] (16个CPU)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:165
-msgid "NUMA nodes = 0..1 (2 NUMA nodes; NUMA0 = 0..7, NUMA1 = 8..15)"
-msgstr "NUMA 节点 = 0..1 (2个NUMA节点；NUMA0 = 0..7, NUMA1 = 8..15)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:167
-msgid "**NUMA extension**:"
-msgstr "**NUMA 扩展**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:169
-msgid ""
-"Both NPUs are on NUMA0, so each pool extends to the nearest NUMA node to "
-"reduce contention."
-msgstr "两个NPU都在NUMA0上，因此每个池扩展到最近的NUMA节点以减少争用。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:170
-msgid "NPU0 extends to NUMA1 -> [0..15]"
-msgstr "NPU0 扩展到 NUMA1 -> [0..15]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:171
-msgid "NPU1 extends to NUMA1 -> [0..15]"
-msgstr "NPU1 扩展到 NUMA1 -> [0..15]"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:173
-msgid ""
-"Because both pools are identical, the allocator applies average "
-"distribution across NPUs to avoid overlap. With a pool [0..15] and 2 "
-"NPUs, the final pools become:"
-msgstr "由于两个池相同，分配器应用跨NPU的平均分配以避免重叠。对于池 [0..15] 和 2个NPU，最终池变为："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Assigned CPU Cores (topo_affinity)"
-msgstr "分配的CPU核心 (topo_affinity)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "0-7"
-msgstr "0-7"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 0-1, `Main`: 2-5, `ACL`: 6, `Release`: 7"
-msgstr "`IRQ`: 0-1, `Main`: 2-5, `ACL`: 6, `Release`: 7"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "8-15"
-msgstr "8-15"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "`IRQ`: 8-9, `Main`: 10-13, `ACL`: 14, `Release`: 15"
-msgstr "`IRQ`: 8-9, `Main`: 10-13, `ACL`: 14, `Release`: 15"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:180
-msgid "Example 6: Minimum CPUs per NPU"
-msgstr "示例 6: 每个NPU的最小CPU数"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:184
-msgid "total_npus = 2"
-msgstr "total_npus = 2"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:185
-msgid "allowed_cpus = [0..7] (8 CPUs)"
-msgstr "allowed_cpus = [0..7] (8个CPU)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:186
-msgid ""
-"NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..3, NUMA1 = "
-"4..7)"
-msgstr "NUMA 节点 = 0..1 (2个NUMA节点，对称布局；NUMA0 = 0..3, NUMA1 = 4..7)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:188
-msgid "**Result**:"
-msgstr "**结果**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:190
-msgid ""
-"base = 4, which is < 5, so binding fails with: \"Insufficient CPUs for "
-"binding with IRQ/ACL/REL reservations...\""
-msgstr "base = 4，小于5，因此绑定失败，错误信息为：\"用于IRQ/ACL/REL预留绑定的CPU不足...\""
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Assigned CPU Cores"
-msgstr "分配的CPU核心"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "N/A"
-msgstr "不适用"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md
-msgid "Binding error (insufficient CPUs per NPU)"
-msgstr "绑定错误（每个NPU的CPU不足）"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:197
-msgid ""
-"To resolve, either reduce total_npus or enlarge the cpuset so that each "
-"NPU has at least 5 CPUs."
-msgstr "要解决此问题，要么减少 total_npus，要么扩大 cpuset，使每个NPU至少有5个CPU。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:199
-msgid "Logging and verification"
-msgstr "日志记录与验证"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:201
-msgid "Logs show the selected binding mode and the allocation plan, for example:"
-msgstr "日志显示选定的绑定模式和分配计划，例如："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:202
-msgid "`[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`"
-msgstr "`[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:203
-msgid "`The CPU allocation plan is as follows: ...`"
-msgstr "`CPU分配计划如下：...`"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:204
-msgid "You can verify affinity via taskset or `/proc/<pid>/status` after startup."
-msgstr "启动后，您可以通过 taskset 或 `/proc/<pid>/status` 验证亲和性。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:206
-msgid "Limitations & Notes"
-msgstr "限制与注意事项"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:208
-msgid "**ARM‑only**: Binding is skipped on non‑ARM CPUs."
-msgstr "**仅限ARM**：在非ARM CPU上跳过绑定。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:209
-msgid ""
-"**Minimum CPU requirement**: Each logical NPU requires at least 5 CPUs. "
-"If the cpuset is smaller, binding fails with an error."
-msgstr "**最小CPU要求**：每个逻辑NPU至少需要5个CPU。如果cpuset更小，绑定将失败并报错。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:210
-msgid ""
-"**NUMA symmetry assumption**: For best locality, the current strategies "
-"assume the cpuset is evenly distributed across NUMA nodes and CPU "
-"numbering aligns with NUMA layout; otherwise NUMA locality may be "
-"suboptimal."
-msgstr "**NUMA对称性假设**：为获得最佳局部性，当前策略假设cpuset在NUMA节点间均匀分布，且CPU编号与NUMA布局对齐；否则NUMA局部性可能不理想。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:211
-msgid ""
-"Example (symmetric layout): 2 NUMA nodes, 64 CPUs total. NUMA0 = CPUs "
-"0–31, NUMA1 = CPUs 32–63, and the cpuset is 0–63. With 4 logical NPUs, "
-"global slicing yields 16 CPUs per NPU (0–15, 16–31, 32–47, 48–63), so "
-"each NPU’s pool stays within a single NUMA node."
-msgstr ""
-"示例（对称布局）：2个NUMA节点，共64个CPU。NUMA0 = CPU 0–31，NUMA1 = CPU "
-"32–63，cpuset为0–63。对于4个逻辑NPU，全局切片为每个NPU分配16个CPU (0–15, 16–31, 32–47, "
-"48–63)，因此每个NPU的CPU池都保持在单个NUMA节点内。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:212
-msgid "**Runtime dependencies**:"
-msgstr "**运行时依赖项**："
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:213
-msgid "Requires npu‑smi and lscpu commands."
-msgstr "需要 npu‑smi 和 lscpu 命令。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:214
-msgid "IRQ binding requires write access to /proc/irq."
-msgstr "IRQ绑定需要对 /proc/irq 的写访问权限。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:215
-msgid "Memory binding requires migratepages; otherwise it is skipped."
-msgstr "内存绑定需要 migratepages；否则将跳过此步骤。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:216
-msgid ""
-"**IRQ side effects**: irqbalance may be stopped to avoid overriding "
-"bindings."
-msgstr "**IRQ副作用**：可能会停止 irqbalance 服务以避免覆盖绑定。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:217
-msgid ""
-"**Per‑process behavior**: Only the current rank’s NPU is used for IRQ "
-"binding to avoid cross‑process overwrite."
-msgstr "**每进程行为**：仅使用当前 rank 的 NPU 进行 IRQ 绑定，以避免跨进程覆盖。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:219
-msgid "Debug logging"
-msgstr "调试日志"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:221
-msgid ""
-"Use the standard vLLM logging configuration to enable debug logs. The "
-"binding process emits debug messages (e.g., `[cpu_global_slice] ...`) "
-"when debug level is enabled."
-msgstr "使用标准的 vLLM 日志配置来启用调试日志。当启用调试级别时，绑定过程会发出调试消息（例如 `[cpu_global_slice] ...`）。"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:223
-msgid "References"
-msgstr "参考资料"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:225
-msgid ""
-"CPU binding implementation: vllm_ascend/cpu_binding.py (`DeviceInfo`, "
-"`CpuAlloc`, `bind_cpus`)"
-msgstr ""
-"CPU 绑定实现：vllm_ascend/cpu_binding.py (`DeviceInfo`, `CpuAlloc`, "
-"`bind_cpus`)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:226
-msgid ""
-"Worker integration: vllm_ascend/worker/worker.py "
-"(`NPUWorker._init_device`)"
-msgstr "Worker 集成：vllm_ascend/worker/worker.py (`NPUWorker._init_device`)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:227
-msgid ""
-"Additional config option: "
-"docs/source/user_guide/configuration/additional_config.md "
-"(`enable_cpu_binding`)"
-msgstr ""
-"附加配置选项：docs/source/user_guide/configuration/additional_config.md "
-"(`enable_cpu_binding`)"
-
-#: ../../source/developer_guide/Design_Documents/cpu_binding.md:228
-msgid "Tests: tests/ut/device_allocator/test_cpu_binding.py"
-msgstr "测试：tests/ut/device_allocator/test_cpu_binding.py"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po
@@ -1,378 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:1
-msgid "Disaggregated-prefill"
-msgstr "解耦式预填充"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:3
-msgid "Why disaggregated-prefill?"
-msgstr "为何需要解耦式预填充？"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:5
-msgid ""
-"This feature addresses the need to optimize the **Time Per Output Token "
-"(TPOT)** and **Time To First Token (TTFT)** in large-scale inference "
-"tasks. The motivation is two-fold:"
-msgstr "此功能旨在优化大规模推理任务中的**单输出令牌时间 (TPOT)** 和**首令牌时间 (TTFT)**。其动机主要有两方面："
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:7
-msgid ""
-"**Adjusting Parallel Strategy and Instance Count for P and D Nodes**   "
-"Using the disaggregated-prefill strategy, this feature allows the system "
-"to flexibly adjust the parallelization strategy (e.g., data parallelism "
-"(dp), tensor parallelism (tp), and expert parallelism (ep)) and the "
-"instance count for both P (Prefiller) and D (Decoder) nodes. This leads "
-"to better system performance tuning, particularly for **TTFT** and "
-"**TPOT**."
-msgstr ""
-"**调整 P 节点和 D 节点的并行策略与实例数量**   采用解耦式预填充策略，此功能允许系统灵活调整 P（预填充器）节点和 "
-"D（解码器）节点的并行化策略（例如数据并行 (dp)、张量并行 (tp) 和专家并行 "
-"(ep)）以及实例数量。这有助于实现更好的系统性能调优，特别是针对 **TTFT** 和 **TPOT**。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:10
-msgid ""
-"**Optimizing TPOT** Without the disaggregated-prefill strategy, prefill "
-"tasks are inserted during decoding, which results in inefficiencies and "
-"delays. Disaggregated-prefill solves this by allowing for better control "
-"over the system's **TPOT**. By managing chunked prefill tasks "
-"effectively, the system avoids the challenge of determining the optimal "
-"chunk size and provides more reliable control over the time taken for "
-"generating output tokens."
-msgstr ""
-"**优化 TPOT** 在没有解耦式预填充策略的情况下，预填充任务会在解码过程中插入，导致效率低下和延迟。解耦式预填充通过允许更好地控制系统 "
-"**TPOT** 来解决此问题。通过有效管理分块的预填充任务，系统避免了确定最佳分块大小的挑战，并对生成输出令牌所需时间提供了更可靠的控制。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:15
-msgid "Usage"
-msgstr "使用方法"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:17
-msgid ""
-"vLLM Ascend currently supports two types of connectors for handling KV "
-"cache management:"
-msgstr "vLLM Ascend 目前支持两种用于处理 KV 缓存管理的连接器："
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:19
-msgid "**MooncakeConnector**: D nodes pull KV cache from P nodes."
-msgstr "**MooncakeConnector**：D 节点从 P 节点拉取 KV 缓存。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:20
-msgid ""
-"**MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a "
-"layered manner."
-msgstr "**MooncakeLayerwiseConnector**：P 节点以分层方式将 KV 缓存推送到 D 节点。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:22
-msgid ""
-"For step-by-step deployment and configuration, refer to the following "
-"guide:   "
-"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
-msgstr ""
-"有关分步部署和配置，请参考以下指南：   "
-"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:27
-msgid "How It Works"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:29
-msgid "1. Design Approach"
-msgstr "1.设计思路"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:31
-msgid ""
-"Under the disaggregated-prefill, a global proxy receives external "
-"requests, forwarding prefill to P nodes and decode to D nodes; the KV "
-"cache (key-value cache) is exchanged between P and D nodes via peer-to-"
-"peer (P2P) communication."
-msgstr ""
-"在解耦式预填充架构下，一个全局代理接收外部请求，将预填充请求转发给 P 节点，将解码请求转发给 D 节点；KV 缓存（键值缓存）通过点对点 "
-"(P2P) 通信在 P 节点和 D 节点之间交换。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:33
-msgid "2. Implementation Design"
-msgstr "2.实现设计"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:35
-msgid ""
-"Our design diagram is shown below, illustrating the pull and push schemes"
-" respectively. ![alt text](../../assets/disaggregated_prefill_pull.png) "
-"![alt text](../../assets/disaggregated_prefill_push.png)"
-msgstr ""
-"我们的设计图如下所示，分别展示了拉取和推送方案。![alt "
-"text](../../assets/disaggregated_prefill_pull.png) ![alt "
-"text](../../assets/disaggregated_prefill_push.png)"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:35
-msgid "alt text"
-msgstr "替代文本"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:39
-msgid "Mooncake Connector"
-msgstr "Mooncake 连接器"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:41
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:49
-msgid "The request is sent to the Proxy's `_handle_completions` endpoint."
-msgstr "请求被发送到代理的 `_handle_completions` 端点。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:42
-msgid ""
-"The Proxy calls `select_prefiller` to choose a P node and forwards the "
-"request, configuring `kv_transfer_params` with `do_remote_decode=True`, "
-"`max_completion_tokens=1`, and `min_tokens=1`."
-msgstr ""
-"代理调用 `select_prefiller` 选择一个 P 节点并转发请求，配置 `kv_transfer_params` 为 "
-"`do_remote_decode=True`、`max_completion_tokens=1` 和 `min_tokens=1`。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:43
-msgid ""
-"After the P node's scheduler finishes prefill, `update_from_output` "
-"invokes the schedule connector's `request_finished` to defer KV cache "
-"release, constructs `kv_transfer_params` with `do_remote_prefill=True`, "
-"and returns to the Proxy."
-msgstr ""
-"P 节点的调度器完成预填充后，`update_from_output` 调用调度连接器的 `request_finished` 以延迟释放 KV "
-"缓存，构建 `kv_transfer_params` 为 `do_remote_prefill=True`，并返回给代理。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:44
-msgid ""
-"The Proxy calls `select_decoder` to choose a D node and forwards the "
-"request."
-msgstr "代理调用 `select_decoder` 选择一个 D 节点并转发请求。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:45
-msgid ""
-"On the D node, the scheduler marks the request as "
-"`RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls "
-"`kv_connector_no_forward` to pull the remote KV cache, then notifies the "
-"P node to release KV cache and proceeds with decoding to return the "
-"result."
-msgstr ""
-"在 D 节点上，调度器将请求标记为 `RequestStatus.WAITING_FOR_REMOTE_KVS`，预分配 KV 缓存，调用 "
-"`kv_connector_no_forward` 拉取远程 KV 缓存，然后通知 P 节点释放 KV 缓存并继续解码以返回结果。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:47
-msgid "Mooncake Layerwise Connector"
-msgstr "Mooncake 分层连接器"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:50
-msgid ""
-"The Proxy calls `select_decoder` to choose a D node and forwards the "
-"request, configuring `kv_transfer_params` with `do_remote_prefill=True` "
-"and setting the `metaserver` endpoint."
-msgstr ""
-"代理调用 `select_decoder` 选择一个 D 节点并转发请求，配置 `kv_transfer_params` 为 "
-"`do_remote_prefill=True` 并设置 `metaserver` 端点。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:51
-msgid ""
-"On the D node, the scheduler uses `kv_transfer_params` to mark the "
-"request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV "
-"cache, then calls `kv_connector_no_forward` to send a request to the "
-"metaserver and waits for the KV cache transfer to complete."
-msgstr ""
-"在 D 节点上，调度器使用 `kv_transfer_params` 将请求标记为 "
-"`RequestStatus.WAITING_FOR_REMOTE_KVS`，预分配 KV 缓存，然后调用 "
-"`kv_connector_no_forward` 向元服务器发送请求并等待 KV 缓存传输完成。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:52
-msgid ""
-"The Proxy's `metaserver` endpoint receives the request, calls "
-"`select_prefiller` to choose a P node, and forwards it with "
-"`kv_transfer_params` set to `do_remote_decode=True`, "
-"`max_completion_tokens=1`, and `min_tokens=1`."
-msgstr ""
-"代理的 `metaserver` 端点接收请求，调用 `select_prefiller` 选择一个 P 节点，并转发请求，设置 "
-"`kv_transfer_params` 为 `do_remote_decode=True`、`max_completion_tokens=1` "
-"和 `min_tokens=1`。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:53
-msgid ""
-"During processing, the P node's scheduler pushes KV cache layer-wise; "
-"once all layers pushing is complete, it releases the request and notifies"
-" the D node to begin decoding."
-msgstr "在处理过程中，P 节点的调度器逐层推送 KV 缓存；所有层推送完成后，它释放请求并通知 D 节点开始解码。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:54
-msgid "The D node performs decoding and returns the result."
-msgstr "D 节点执行解码并返回结果。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:56
-msgid "3. Interface Design"
-msgstr "3. 接口设计"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:58
-msgid ""
-"Taking MooncakeConnector as an example, the system is organized into "
-"three primary classes:"
-msgstr "以 MooncakeConnector 为例，系统被组织成三个主要类："
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:60
-msgid "**MooncakeConnector**: Base class that provides core interfaces."
-msgstr "**MooncakeConnector**：提供核心接口的基类。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:61
-msgid ""
-"**MooncakeConnectorScheduler**: Interface for scheduling the connectors "
-"within the engine core, responsible for managing KV cache transfer "
-"requirements and completion."
-msgstr "**MooncakeConnectorScheduler**：用于在引擎核心内调度连接器的接口，负责管理 KV 缓存传输需求和完成情况。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:62
-msgid ""
-"**MooncakeConnectorWorker**: Interface for managing KV cache registration"
-" and transfer in worker processes."
-msgstr "**MooncakeConnectorWorker**：用于在工作进程中管理 KV 缓存注册和传输的接口。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:64
-msgid "4. Specifications Design"
-msgstr "4.规格设计"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:66
-msgid ""
-"This feature is flexible and supports various configurations, including "
-"setups with MLA and GQA models. It is compatible with A2 and A3 hardware "
-"configurations and facilitates scenarios involving equal TP setups and "
-"certain unequal TP setups across multiple P and D nodes."
-msgstr ""
-"此功能灵活，支持多种配置，包括使用 MLA 和 GQA 模型的设置。它与 A2 和 A3 硬件配置兼容，并支持跨多个 P 节点和 D "
-"节点的相等和不相等 TP 设置场景。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "Feature"
-msgstr "功能"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "Status"
-msgstr "状态"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "A2"
-msgstr "A2"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "🟢 Functional"
-msgstr "🟢 功能正常"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "A3"
-msgstr "A3"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "equal TP configuration"
-msgstr "相等 TP 配置"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "unequal TP configuration"
-msgstr "不相等 TP 配置"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "MLA"
-msgstr "MLA"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
-msgid "GQA"
-msgstr "GQA"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:77
-msgid "🟢 Functional: Fully operational, with ongoing optimizations."
-msgstr "🟢 功能正常：完全可运行，正在进行优化。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:78
-msgid "🔵 Experimental: Experimental support, interfaces and functions may change."
-msgstr "🔵 实验性：实验性支持，接口和功能可能发生变化。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:79
-msgid "🚧 WIP: Under active development, will be supported soon."
-msgstr "🚧 开发中：正在积极开发，即将支持。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:80
-msgid ""
-"🟡 Planned: Scheduled for future implementation (some may have open "
-"PRs/RFCs)."
-msgstr "🟡 计划中：计划在未来实现（部分可能已有开放的 PR/RFC）。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:81
-msgid "🔴 NO plan/Deprecated: No plan or deprecated by vLLM."
-msgstr "🔴 无计划/已弃用：无计划或已被 vLLM 弃用。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:85
-msgid "DFX Analysis"
-msgstr "DFX 分析"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:87
-msgid "1. Config Parameter Validation"
-msgstr "1.配置参数验证"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:89
-msgid ""
-"Validate KV transfer config by checking whether the kv_connector type is "
-"supported and whether kv_connector_module_path exists and is loadable. On"
-" transfer failures, emit clear error logs for diagnostics."
-msgstr ""
-"通过检查 kv_connector 类型是否受支持以及 kv_connector_module_path 是否存在且可加载来验证 KV "
-"传输配置。传输失败时，发出清晰的错误日志以供诊断。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:91
-msgid "2. Port Conflict Detection"
-msgstr "2.端口冲突检测"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:93
-msgid ""
-"Before startup, perform a port-usage check on configured ports (e.g., "
-"rpc_port, metrics_port, http_port/metaserver) by attempting to bind. If a"
-" port is already in use, fail fast and log an error."
-msgstr ""
-"启动前，通过尝试绑定来对配置的端口（例如 "
-"rpc_port、metrics_port、http_port/metaserver）进行端口使用情况检查。如果端口已被占用，快速失败并记录错误。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:95
-msgid "3. PD Ratio Validation"
-msgstr "3.PD 比例验证"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:97
-msgid ""
-"Under non-symmetric PD scenarios, validate the P-to-D tp ratio against "
-"expected and scheduling constraints to ensure correct and reliable "
-"operation."
-msgstr "在非对称 PD 场景下，根据预期和调度约束验证 P 到 D 的 tp 比例，以确保正确可靠的操作。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:101
-msgid "Limitations"
-msgstr "限制"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:103
-msgid ""
-"Heterogeneous P and D nodes are not supported—for example, running P "
-"nodes on A2 and D nodes on A3."
-msgstr "不支持异构的 P 节点和 D 节点——例如，在 A2 上运行 P 节点，在 A3 上运行 D 节点。"
-
-#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:105
-msgid ""
-"In non-symmetric TP configurations, only cases where the P nodes have a "
-"higher TP degree than the D nodes and the P TP count is an integer "
-"multiple of the D TP count are supported (i.e., P_tp > D_tp and P_tp % "
-"D_tp = 0)."
-msgstr ""
-"在非对称 TP 配置中，仅支持 P 节点的 TP 度数高于 D 节点且 P 节点的 TP 数量是 D 节点 TP 数量的整数倍的情况（即 P_tp"
-" > D_tp 且 P_tp % D_tp = 0）。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po
@@ -1,480 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:1
-msgid "Expert Parallelism Load Balancer (EPLB)"
-msgstr "专家并行负载均衡器 (EPLB)"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:3
-msgid "Why We Need EPLB?"
-msgstr "为什么需要 EPLB？"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:5
-msgid ""
-"When using Expert Parallelism (EP), different experts are assigned to "
-"different NPUs. Given that the load of various experts may vary depending"
-" on the current workload, it is crucial to maintain balanced loads across"
-" different NPUs. We adopt a redundant experts strategy by duplicating "
-"heavily-loaded experts. Then, we heuristically pack these duplicated "
-"experts onto NPUs to ensure load balancing across them. Moreover, thanks "
-"to the group-limited expert routing used in MoE models, we also attempt "
-"to place experts of the same group on the same node to reduce inter-node "
-"data traffic, whenever possible."
-msgstr ""
-"在使用专家并行 (EP) 时，不同的专家被分配到不同的 NPU 上。鉴于不同专家的负载可能因当前工作负载而异，保持不同 NPU "
-"之间的负载均衡至关重要。我们采用冗余专家策略，通过复制高负载的专家来实现。然后，我们启发式地将这些复制的专家打包到 NPU "
-"上，以确保它们之间的负载均衡。此外，得益于 MoE "
-"模型中使用的组限制专家路由，我们也尽可能将同一组的专家放置在同一节点上，以减少节点间的数据流量。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:7
-msgid ""
-"To facilitate reproduction and deployment, vLLM Ascend supports the "
-"deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. "
-"The algorithm computes a balanced expert replication and placement plan "
-"based on the estimated expert loads. Note that the exact method for "
-"predicting expert loads is outside the scope of this repository. A common"
-" method is to use a moving average of historical statistics."
-msgstr ""
-"为了方便复现和部署，vLLM Ascend 在 `vllm_ascend/eplb/core/policy` 中支持已部署的 EP "
-"负载均衡算法。该算法根据估计的专家负载计算一个均衡的专家复制和放置计划。请注意，预测专家负载的具体方法不在本仓库的讨论范围内。一种常见的方法是使用历史统计数据的移动平均值。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:9
-msgid "![eplb](../../assets/eplb.png)"
-msgstr "![eplb](../../assets/eplb.png)"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:9
-msgid "eplb"
-msgstr "eplb"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:11
-msgid "How to Use EPLB?"
-msgstr "如何使用 EPLB？"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:13
-msgid ""
-"Please refer to the EPLB section of the user guide for detailed "
-"information: [How to Use "
-"EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)"
-msgstr ""
-"请参阅用户指南中的 EPLB 部分以获取详细信息：[如何使用 "
-"EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:15
-msgid "How It Works?"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:17
-msgid "**EPLB Module Architecture**"
-msgstr "**EPLB 模块架构**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:40
-msgid ""
-"**1. Adaptor Module**   *Handles registration and adaptation for "
-"different MoE model types*"
-msgstr "**1. 适配器模块**   *处理不同 MoE 模型类型的注册和适配*"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:43
-msgid ""
-"`abstract_adaptor.py`   Abstract base class defining unified registration"
-" interfaces for EPLB adapters"
-msgstr "`abstract_adaptor.py`   定义 EPLB 适配器统一注册接口的抽象基类"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:45
-msgid ""
-"`vllm_adaptor.py`   Implementation supporting Qwen3-MoE and DeepSeek "
-"models, standardizing parameter handling for policy algorithms"
-msgstr "`vllm_adaptor.py`   支持 Qwen3-MoE 和 DeepSeek 模型的实现，标准化策略算法的参数处理"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:48
-msgid ""
-"**2. Core Module**   *Implements core algorithms, updates, and "
-"asynchronous processing*"
-msgstr "**2. 核心模块**   *实现核心算法、更新和异步处理*"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:51
-msgid ""
-"**Policy Submodule**   *Load balancing algorithms with factory pattern "
-"instantiation*"
-msgstr "**策略子模块**   *采用工厂模式实例化的负载均衡算法*"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:53
-msgid ""
-"`policy_abstract.py`   Abstract class for load balancing strategy "
-"interfaces"
-msgstr "`policy_abstract.py`   负载均衡策略接口的抽象类"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:55
-msgid ""
-"`policy_default_eplb.py`   Default implementation of open-source EPLB "
-"paper algorithm"
-msgstr "`policy_default_eplb.py`   开源 EPLB 论文算法的默认实现"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:57
-msgid ""
-"`policy_swift_balancer.py`   Enhanced version optimizing expert swaps for"
-" low-bandwidth devices (e.g., A2)"
-msgstr "`policy_swift_balancer.py`   针对低带宽设备（例如 A2）优化专家交换的增强版本"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:59
-msgid ""
-"`policy_flashlb.py`   Threshold-based adjustment reducing operational "
-"costs through layer-wise fluctuation detection"
-msgstr "`policy_flashlb.py`   基于阈值的调整，通过逐层波动检测降低操作成本"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:61
-msgid ""
-"`policy_factory.py`   Strategy factory for automatic algorithm "
-"instantiation"
-msgstr "`policy_factory.py`   用于自动算法实例化的策略工厂"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:64
-msgid ""
-"`eplb_device_transfer_loader.py`   Manages expert table/weight "
-"transmission and updates"
-msgstr "`eplb_device_transfer_loader.py`   管理专家表/权重的传输和更新"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:66
-msgid "`eplb_utils.py`   Utilities for expert table initialization and mapping"
-msgstr "`eplb_utils.py`   用于专家表初始化和映射的实用工具"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:68
-msgid ""
-"`eplb_worker.py`   Asynchronous algorithm orchestration and result "
-"processing"
-msgstr "`eplb_worker.py`   异步算法编排和结果处理"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:71
-msgid "**3. System Components**"
-msgstr "**3. 系统组件**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:73
-msgid ""
-"`eplb_updator.py`   Central coordinator for load balancing during "
-"inference workflows"
-msgstr "`eplb_updator.py`   推理工作流中负载均衡的中心协调器"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:75
-msgid "`utils.py`   General utilities for EPLB interface registration"
-msgstr "`utils.py`   EPLB 接口注册的通用实用工具"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:78
-msgid "*Key Optimizations:*"
-msgstr "*关键优化点：*"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:80
-msgid "Maintained original structure while improving technical clarity"
-msgstr "保持原始结构的同时提高了技术清晰度"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:81
-msgid "Standardized terminology"
-msgstr "标准化术语"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:82
-msgid "Enhanced algorithm differentiation through concise descriptors"
-msgstr "通过简洁的描述符增强了算法区分度"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:83
-msgid "Improved scoping through hierarchical presentation"
-msgstr "通过分层展示改进了范围界定"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:84
-msgid "Preserved file/class relationships while optimizing readability"
-msgstr "在优化可读性的同时保留了文件/类关系"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:86
-msgid "Default Algorithm"
-msgstr "默认算法"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:88
-msgid "Hierarchical Load Balancing"
-msgstr "分层负载均衡"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:90
-msgid ""
-"When the number of server nodes evenly divides the number of expert "
-"groups, we use the hierarchical load balancing policy to leverage group-"
-"limited expert routing. We first pack the expert groups onto nodes "
-"evenly, ensuring balanced loads across different nodes. Then, we "
-"replicate the experts within each node. Finally, we pack the replicated "
-"experts onto individual NPUs to ensure load balancing across them. The "
-"hierarchical load balancing policy can be used in the prefilling stage "
-"with a smaller expert-parallel size."
-msgstr ""
-"当服务器节点数量能整除专家组数量时，我们使用分层负载均衡策略来利用组限制专家路由。我们首先将专家组均匀地打包到节点上，确保不同节点间的负载均衡。然后，我们在每个节点内复制专家。最后，我们将复制的专家打包到各个"
-" NPU 上，以确保它们之间的负载均衡。分层负载均衡策略可以在预填充阶段使用，此时专家并行规模较小。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:92
-msgid "Global Load Balancing"
-msgstr "全局负载均衡"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:94
-msgid ""
-"In other cases, we use the global load balancing policy, which replicates"
-" experts globally regardless of expert groups, and packs the replicated "
-"experts onto individual NPUs. This policy can be adopted in the decoding "
-"stage with a larger expert-parallel size."
-msgstr ""
-"在其他情况下，我们使用全局负载均衡策略，该策略不考虑专家组，而是在全局范围内复制专家，并将复制的专家打包到各个 NPU "
-"上。此策略可以在解码阶段采用，此时专家并行规模较大。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:96
-msgid "Add a New EPLB Policy"
-msgstr "添加新的 EPLB 策略"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:98
-msgid ""
-"If you want to add a new eplb policy to vllm_ascend, you must follow "
-"these steps:"
-msgstr "如果你想向 vllm_ascend 添加一个新的 eplb 策略，必须遵循以下步骤："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:100
-msgid ""
-"Inherit the `EplbPolicy` abstract class of `policy_abstract.py`  and "
-"override the `rebalance_experts` interface, ensuring consistent input "
-"parameters `current_expert_table`, `expert_workload` and return types "
-"`newplacement`. For example:"
-msgstr ""
-"继承 `policy_abstract.py` 中的 `EplbPolicy` 抽象类，并重写 `rebalance_experts` "
-"接口，确保输入参数 `current_expert_table`、`expert_workload` 和返回类型 `newplacement` "
-"保持一致。例如："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:126
-msgid ""
-"To add a new EPLB algorithm, include the policy type and its "
-"corresponding implementation class in the `PolicyFactory` of "
-"`policy_factory.py`."
-msgstr "要添加新的 EPLB 算法，请在 `policy_factory.py` 的 `PolicyFactory` 中包含策略类型及其对应的实现类。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:128
-msgid "Add a New MoE Model"
-msgstr "添加新的 MoE 模型"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:130
-msgid "**Implementation Guide for Model Integration**"
-msgstr "**模型集成实施指南**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:132
-msgid "**Adapter File Modification**"
-msgstr "**适配器文件修改**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:133
-msgid "Inherit or modify `vllm_ascend/eplb/adaptor/vllm_adaptor.py`"
-msgstr "继承或修改 `vllm_ascend/eplb/adaptor/vllm_adaptor.py`"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:134
-msgid "Add processing logic for key parameters:"
-msgstr "为关键参数添加处理逻辑："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:135
-msgid "`num_dense_layers`"
-msgstr "`num_dense_layers`"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:136
-msgid "`global_expert_num`"
-msgstr "`global_expert_num`"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:137
-msgid "`num_roe_layers`"
-msgstr "`num_roe_layers`"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:138
-msgid "Ensure parameter synchronization in the `model_register` function."
-msgstr "确保在 `model_register` 函数中进行参数同步。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:140
-msgid "For example:"
-msgstr "例如："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:142
-msgid "Modify `__init__` of `vllm_adaptor.py` to add a new moe model eplb params:"
-msgstr "修改 `vllm_adaptor.py` 的 `__init__` 以添加新 MoE 模型的 eplb 参数："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:150
-msgid ""
-"Modify `model_register` of `vllm_adaptor.py` to register eplb params for "
-"new moe model:"
-msgstr "修改 `vllm_adaptor.py` 的 `model_register` 以注册新 MoE 模型的 eplb 参数："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:157
-msgid "**MoE Feature Integration**"
-msgstr "**MoE 功能集成**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:158
-msgid "Extend `vllm_ascend/eplb/utils.py` with MoE-specific methods"
-msgstr "使用 MoE 特定方法扩展 `vllm_ascend/eplb/utils.py`"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:159
-msgid "Implement required functionality for expert routing or weight management"
-msgstr "实现专家路由或权重管理所需的功能"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:161
-msgid "**Registration Logic Update**"
-msgstr "**注册逻辑更新**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:162
-msgid "Add patch logic within the `model_register` function"
-msgstr "在 `model_register` 函数内添加补丁逻辑"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:163
-msgid "Maintain backward compatibility with existing model types"
-msgstr "保持与现有模型类型的向后兼容性"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:165
-msgid "**Validation & Testing**"
-msgstr "**验证与测试**"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:166
-msgid "Verify parameter consistency across layers"
-msgstr "验证跨层的参数一致性"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:167
-msgid "Test cross-device communication for expert tables"
-msgstr "测试专家表的跨设备通信"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:168
-msgid "Benchmark against baseline implementations (e.g., Qwen3-MoE)"
-msgstr "与基线实现（例如 Qwen3-MoE）进行基准测试"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:170
-msgid "*Key Implementation Notes:*"
-msgstr "*关键实施说明：*"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:172
-msgid "Preserve existing interface contracts in abstract classes"
-msgstr "在抽象类中保留现有的接口契约"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:173
-msgid "Use decorators for non-intrusive patch integration"
-msgstr "使用装饰器进行非侵入式补丁集成"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:174
-msgid "Leverage `eplb_utils.py` for shared expert mapping operations"
-msgstr "利用 `eplb_utils.py` 进行共享的专家映射操作"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:176
-msgid "DFX"
-msgstr "DFX"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:178
-msgid "Parameter Validation"
-msgstr "参数验证"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:180
-msgid "Integer Parameters"
-msgstr "整数参数"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:182
-msgid ""
-"All integer input parameters must explicitly specify their maximum and "
-"minimum values and be subject to valid value validation. For example, "
-"`expert_heat_collection_interval` must be greater than 0:"
-msgstr ""
-"所有整型输入参数必须明确指定其最大值和最小值，并接受有效值验证。例如，`expert_heat_collection_interval` "
-"必须大于0："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:197
-msgid "File Path"
-msgstr "文件路径"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:199
-msgid ""
-"The file path for EPLB must be checked for legality, such as whether the "
-"file path is valid and whether it has appropriate read and write "
-"permissions. For example:"
-msgstr "必须检查 EPLB 文件路径的合法性，例如文件路径是否有效以及是否具有适当的读写权限。例如："
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:225
-msgid "Function Specifications"
-msgstr "功能规范"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:227
-msgid "Initialization Function"
-msgstr "初始化函数"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:229
-msgid ""
-"All EPLB parameters must be initialized by default during initialization,"
-" with specified parameter types and default values for proper handling."
-msgstr "所有 EPLB 参数在初始化期间必须默认初始化，并指定参数类型和默认值以便正确处理。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:231
-msgid "General Functions"
-msgstr "通用函数"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:233
-msgid ""
-"All method arguments must specify parameter types and default values, and"
-" functions must include default return value handling for default "
-"arguments. It is recommended to use `try-except` blocks to handle the "
-"function body, specifying the type of exception captured and the failure "
-"handling (e.g., logging exceptions or returning a failure status)."
-msgstr ""
-"所有方法参数必须指定参数类型和默认值，并且函数必须包含针对默认参数的默认返回值处理。建议使用 `try-except` "
-"块来处理函数体，指定捕获的异常类型和失败处理（例如，记录异常或返回失败状态）。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:235
-msgid "Consistency"
-msgstr "一致性"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:236
-msgid "Expert Map"
-msgstr "专家映射"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:237
-msgid ""
-"The expert map must be globally unique during initialization and update. "
-"In a multi-node scenario during initialization, distributed communication"
-" should be used to verify the consistency of expert maps across each "
-"rank. If they are inconsistent, the user should be notified of which "
-"ranks have inconsistent maps. During the update process, if only a few "
-"layers or the expert table of a certain rank has been changed, the "
-"updated expert table must be synchronized with the EPLB's context to "
-"ensure global consistency."
-msgstr ""
-"专家映射在初始化和更新期间必须是全局唯一的。在初始化期间的多节点场景中，应使用分布式通信来验证每个 rank "
-"上专家映射的一致性。如果不一致，应通知用户哪些 rank 的映射不一致。在更新过程中，如果只有少数层或某个 rank "
-"的专家表被更改，则必须将更新后的专家表与 EPLB 的上下文同步，以确保全局一致性。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:242
-msgid "Expert Weight"
-msgstr "专家权重"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:244
-msgid ""
-"When updating expert weights, ensure that the memory allocated for the "
-"expert weights has been released, or that the expert (referring to the "
-"old version) is no longer in use."
-msgstr "更新专家权重时，确保为专家权重分配的内存已被释放，或者专家（指旧版本）不再被使用。"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:246
-msgid "Limitations"
-msgstr "限制"
-
-#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:248
-msgid ""
-"Before using EPLB, start the script and add `export "
-"DYNAMIC_EPLB=\"true\"`. Before performing load data collection (or "
-"performance data collection), start the script and add `export "
-"EXPERT_MAP_RECORD=\"true\"`."
-msgstr ""
-"在使用 EPLB 之前，启动脚本并添加 `export "
-"DYNAMIC_EPLB=\"true\"`。在执行负载数据收集（或性能数据收集）之前，启动脚本并添加 `export "
-"EXPERT_MAP_RECORD=\"true\"`。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po
@@ -1,232 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:1
-msgid "Npugraph_ex"
-msgstr "Npugraph_ex"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:3
-msgid "How Does It Work?"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:5
-msgid ""
-"This is an optimization based on FX graphs, which can be considered an "
-"acceleration solution for the aclgraph mode."
-msgstr "这是一种基于 FX 图的优化，可视为 aclgraph 模式的一种加速方案。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:7
-msgid "You can get its code [code](https://gitcode.com/Ascend/torchair)"
-msgstr "您可以在 [code](https://gitcode.com/Ascend/torchair) 获取其代码"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:9
-msgid "Default FX Graph Optimization"
-msgstr "默认 FX 图优化"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:11
-msgid "FX Graph pass"
-msgstr "FX 图处理过程"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:13
-msgid ""
-"For the intermediate nodes of the model, replace the non-in-place "
-"operators contained in the nodes with in-place operators to reduce memory"
-" movement during computation and improve performance."
-msgstr "对于模型的中间节点，将其包含的非原位运算符替换为原位运算符，以减少计算过程中的内存移动，提升性能。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:14
-msgid ""
-"For the original input parameters of the model, if they include in-place "
-"operators, Dynamo's Functionalize process will replace the in-place "
-"operators with a form of non-in-place operators + copy operators. "
-"npugraph_ex will reverse this process, restoring the in-place operators "
-"and reducing memory movement."
-msgstr ""
-"对于模型的原始输入参数，如果包含原位运算符，Dynamo 的 Functionalize 过程会将其替换为非原位运算符 + "
-"复制运算符的形式。npugraph_ex 将逆转此过程，恢复原位运算符，减少内存移动。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:16
-msgid "FX fusion pass"
-msgstr "FX 融合处理过程"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:18
-msgid ""
-"npugraph_ex now provides three default operator fusion passes, and more "
-"will be added in the future."
-msgstr "npugraph_ex 目前提供三种默认的算子融合处理过程，未来将添加更多。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:20
-msgid ""
-"Operator combinations that meet the replacement rules can be replaced "
-"with the corresponding fused operators."
-msgstr "符合替换规则的算子组合可以被替换为相应的融合算子。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:22
-msgid ""
-"You can get the default [fusion pass "
-"list](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)"
-msgstr "您可以查看默认的[融合处理过程列表](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:24
-msgid "Custom fusion pass"
-msgstr "自定义融合处理过程"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:26
-msgid ""
-"Users can register a custom graph fusion pass in TorchAir to modify "
-"PyTorch FX graphs. The registration relies on the register_replacement "
-"API."
-msgstr ""
-"用户可以在 TorchAir 中注册自定义的图融合处理过程，以修改 PyTorch FX 图。注册依赖于 register_replacement"
-" API。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:28
-msgid "Below is the declaration of this API and a demo of its usage."
-msgstr "以下是该 API 的声明及其使用示例。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "Parameter Name"
-msgstr "参数名称"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "Input/Output"
-msgstr "输入/输出"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "Explanation"
-msgstr "说明"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "Is necessary"
-msgstr "是否必需"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "search_fn"
-msgstr "search_fn"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "Input"
-msgstr "输入"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid ""
-"This function is the operator combination or calculation logic that you "
-"want to recognize in the FX graph, such as the operator combination that "
-"needs to be fused"
-msgstr "此函数是您希望在 FX 图中识别的算子组合或计算逻辑，例如需要融合的算子组合"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "Yes"
-msgstr "是"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "replace_fn"
-msgstr "replace_fn"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid ""
-"When the combination corresponding to search_fn is found in the target "
-"graph, this function's computation logic will replace the original "
-"subgraph to achieve operator fusion or optimization."
-msgstr "当在目标图中找到与 search_fn 对应的组合时，此函数的计算逻辑将替换原子图，以实现算子融合或优化。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "example_inputs"
-msgstr "example_inputs"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid ""
-"Example input tensors used to track search_fn and replace_fn. The shape "
-"and dtype of the input should match the actual scenario."
-msgstr "用于追踪 search_fn 和 replace_fn 的示例输入张量。输入的形状和数据类型应与实际场景匹配。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "trace_fn"
-msgstr "trace_fn"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid ""
-"By default, only the forward computation graph is tracked, which is "
-"suitable for optimization during the inference phase; if training "
-"scenarios need to be supported, a function that supports backward "
-"tracking can be provided."
-msgstr "默认情况下，仅追踪前向计算图，这适用于推理阶段的优化；如果需要支持训练场景，可以提供支持反向追踪的函数。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "No"
-msgstr "否"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "extra_check"
-msgstr "extra_check"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid ""
-"Find the extra verification function after operator fusion. The "
-"function's input parameter must be a Match object from "
-"torch._inductor.pattern_matcher, and it is used for further custom checks"
-" on the matching result, such as checking whether the fused operators are"
-" on the same stream, checking the device type, checking the input shapes,"
-" and so on."
-msgstr ""
-"算子融合后的额外验证函数。该函数的输入参数必须是来自 torch._inductor.pattern_matcher 的 Match "
-"对象，用于对匹配结果进行进一步的自定义检查，例如检查融合后的算子是否在同一流上、检查设备类型、检查输入形状等。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid "search_fn_pattern"
-msgstr "search_fn_pattern"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
-msgid ""
-"A custom pattern object is generally unnecessary to provide. Its "
-"definition follows the rules of the native PyTorch MultiOutputPattern "
-"object. After passing this parameter, search_fn will no longer be used to"
-" match operator combinations; instead, this parameter will be used "
-"directly as the matching rule."
-msgstr ""
-"通常无需提供自定义模式对象。其定义遵循原生 PyTorch MultiOutputPattern 对象的规则。传入此参数后，将不再使用 "
-"search_fn 来匹配算子组合，而是直接使用此参数作为匹配规则。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:43
-msgid "Usage Example"
-msgstr "使用示例"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:97
-msgid ""
-"The default fusion pass in npugraph_ex is also implemented based on this "
-"API. You can see more examples of using this API in the vllm-ascend and "
-"npugraph_ex code repositories."
-msgstr ""
-"npugraph_ex 中的默认融合处理过程也是基于此 API 实现的。您可以在 vllm-ascend 和 npugraph_ex "
-"代码仓库中查看更多使用此 API 的示例。"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:99
-msgid "DFX"
-msgstr "DFX"
-
-#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:101
-msgid ""
-"By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch "
-"community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX "
-"graphs throughout the entire process."
-msgstr ""
-"通过复用 PyTorch 社区的 TORCH_COMPILE_DEBUG 环境变量，当设置 TORCH_COMPILE_DEBUG=1 "
-"时，将输出整个过程中的 FX 图。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po
@@ -1,225 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:1
-msgid "Patch in vLLM Ascend"
-msgstr "vLLM Ascend 中的补丁"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:3
-msgid ""
-"vLLM Ascend is a platform plugin for vLLM. Due to the different release "
-"cycle of vLLM and vLLM Ascend and their hardware limitations, we need to "
-"patch some code in vLLM to make it compatible with vLLM Ascend."
-msgstr ""
-"vLLM Ascend 是 vLLM 的一个平台插件。由于 vLLM 和 vLLM Ascend 的发布周期不同且存在硬件限制，我们需要对 "
-"vLLM 中的部分代码打补丁，以使其兼容 vLLM Ascend。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:5
-msgid ""
-"In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to "
-"adapt to changes in vLLM."
-msgstr "在 vLLM Ascend 代码中，我们提供了一个补丁模块 `vllm_ascend/patch` 来适配 vLLM 的变更。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:7
-msgid "Principle"
-msgstr "原则"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:9
-msgid ""
-"We should keep in mind that Patch is not the best way to make vLLM Ascend"
-" compatible. It's just a temporary solution. The best way is to "
-"contribute the change to vLLM to make it compatible with vLLM Ascend "
-"initially. In vLLM Ascend, we have the basic principle for Patch "
-"strategy:"
-msgstr ""
-"我们需要牢记，补丁并非实现 vLLM Ascend 兼容性的最佳方式，它只是一个临时解决方案。最佳方式是将修改贡献给 vLLM，使其原生兼容 "
-"vLLM Ascend。在 vLLM Ascend 中，我们遵循以下补丁策略基本原则："
-
-#: ../../source/developer_guide/Design_Documents/patch.md:11
-msgid "Less is more. Please do not patch unless it's the only way currently."
-msgstr "少即是多。除非是当前唯一的方法，否则请不要打补丁。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:12
-msgid ""
-"Once a patch is added, it's required to describe the future plan for "
-"removing the patch."
-msgstr "一旦添加补丁，必须描述未来移除该补丁的计划。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:13
-msgid "Anytime, cleaning the patch code is welcome."
-msgstr "随时欢迎清理补丁代码。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:15
-msgid "How it works"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:17
-msgid "In `vllm_ascend/patch`, you can see the code structure as follows:"
-msgstr "在 `vllm_ascend/patch` 中，你可以看到如下代码结构："
-
-#: ../../source/developer_guide/Design_Documents/patch.md:29
-msgid ""
-"**platform**: The patch code in this directory is for patching the code "
-"in vLLM main process. It's called by "
-"`vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early "
-"when vLLM is initialized."
-msgstr ""
-"**platform**：此目录中的补丁代码用于修补 vLLM 主进程中的代码。它在 vLLM 初始化早期由 "
-"`vllm_ascend/platform::NPUPlatform::pre_register_and_update` 调用。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:30
-msgid ""
-"For online mode, vLLM process calls the platform patch in "
-"`vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when "
-"parsing the cli args."
-msgstr ""
-"对于在线模式，vLLM 进程在解析命令行参数时，会在 "
-"`vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` 处调用平台补丁。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:31
-msgid ""
-"For offline mode, vLLM process calls the platform patch in "
-"`vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when "
-"parsing the input parameters."
-msgstr ""
-"对于离线模式，vLLM 进程在解析输入参数时，会在 "
-"`vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` 处调用平台补丁。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:32
-msgid ""
-"**worker**: The patch code in this directory is for patching the code in "
-"vLLM worker process. It's called by "
-"`vllm_ascend/worker/worker::NPUWorker::__init__` when the vLLM worker "
-"process is initialized."
-msgstr ""
-"**worker**：此目录中的补丁代码用于修补 vLLM worker 进程中的代码。它在 vLLM worker 进程初始化时由 "
-"`vllm_ascend/worker/worker::NPUWorker::__init__` 调用。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:33
-msgid ""
-"For both online and offline mode, vLLM engine core process calls the "
-"worker patch in "
-"`vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` when "
-"initializing the worker process."
-msgstr ""
-"对于在线和离线模式，vLLM 引擎核心进程在初始化 worker 进程时，会在 "
-"`vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` 处调用 "
-"worker 补丁。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:35
-msgid "How to write a patch"
-msgstr "如何编写补丁"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:37
-msgid ""
-"Before writing a patch, following the principle above, we should patch "
-"the least code. If it's necessary, we can patch the code in either "
-"**platform** or **worker** folder. Here is an example to patch "
-"`distributed` module in vLLM."
-msgstr ""
-"在编写补丁前，遵循上述原则，我们应尽可能少地修改代码。如果确有必要，我们可以在 **platform** 或 **worker** "
-"文件夹中打补丁。以下是一个修补 vLLM 中 `distributed` 模块的示例。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:39
-msgid ""
-"Decide which version of vLLM we should patch. For example, after "
-"analysis, here we want to patch both `0.10.0` and `main` of vLLM."
-msgstr "确定我们需要修补哪个版本的 vLLM。例如，经过分析，这里我们想要同时修补 vLLM 的 `0.10.0` 版本和 `main` 分支。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:40
-msgid ""
-"Decide which process we should patch. For example, here `distributed` "
-"belongs to the vLLM main process, so we should patch `platform`."
-msgstr "确定我们需要修补哪个进程。例如，这里的 `distributed` 属于 vLLM 主进程，因此我们应该修补 `platform`。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:41
-#, python-brace-format
-msgid ""
-"Create the patch file in the right folder. The file should be named as "
-"`patch_{module_name}.py`. The example here is "
-"`vllm_ascend/patch/platform/patch_distributed.py`."
-msgstr ""
-"在正确的文件夹中创建补丁文件。文件应命名为 `patch_{module_name}.py`。此处的示例是 "
-"`vllm_ascend/patch/platform/patch_distributed.py`。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:42
-msgid "Write your patch code in the new file. Here is an example:"
-msgstr "在新文件中编写你的补丁代码。以下是一个示例："
-
-#: ../../source/developer_guide/Design_Documents/patch.md:54
-msgid ""
-"Import the patch file in `__init__.py`. In this example, add `import "
-"vllm_ascend.patch.platform.patch_distributed` into "
-"`vllm_ascend/patch/platform/__init__.py`."
-msgstr ""
-"在 `__init__.py` 中导入补丁文件。在此示例中，将 `import "
-"vllm_ascend.patch.platform.patch_distributed` 添加到 "
-"`vllm_ascend/patch/platform/__init__.py` 中。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:55
-msgid ""
-"Add the description of the patch in `vllm_ascend/patch/__init__.py`. The "
-"description format is as follows:"
-msgstr "在 `vllm_ascend/patch/__init__.py` 中添加补丁描述。描述格式如下："
-
-#: ../../source/developer_guide/Design_Documents/patch.md:71
-msgid ""
-"Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend "
-"should contain the Unit Test and E2E Test as well. You can find more "
-"details in [test guide](../contribution/testing.md)"
-msgstr ""
-"添加单元测试和端到端测试。vLLM Ascend 中任何新增的代码都应包含单元测试和端到端测试。更多详情请参阅 "
-"[测试指南](../contribution/testing.md)。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:73
-msgid "Limitations"
-msgstr "限制"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:75
-msgid ""
-"In V1 Engine, vLLM starts three kinds of processes: Main process, "
-"EngineCore process and Worker process. Now vLLM Ascend can only patch the"
-" code in Main process and Worker process by default. If you want to patch"
-" the code running in EngineCore process, you should patch EngineCore "
-"process entirely during setup. Find the entire code in "
-"`vllm.v1.engine.core`. Please override `EngineCoreProc` and "
-"`DPEngineCoreProc` entirely."
-msgstr ""
-"在 V1 引擎中，vLLM 启动三种进程：主进程、EngineCore 进程和 Worker 进程。目前 vLLM Ascend "
-"默认只能修补主进程和 Worker 进程中的代码。如果你想修补 EngineCore 进程中运行的代码，你需要在设置阶段完全修补 "
-"EngineCore 进程。相关完整代码位于 `vllm.v1.engine.core`。请完全重写 `EngineCoreProc` 和 "
-"`DPEngineCoreProc`。"
-
-#: ../../source/developer_guide/Design_Documents/patch.md:76
-msgid ""
-"If you are running edited vLLM code, the version of vLLM may be changed "
-"automatically. For example, if you run the edited vLLM based on v0.9.n, "
-"the version of vLLM may be changed to v0.9.nxxx. In this case, the patch "
-"for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend"
-" can't distinguish the version of the vLLM you're using. In this case, "
-"you can set the environment variable `VLLM_VERSION` to specify the "
-"version of the vLLM you're using, and then the patch for that version "
-"(e.g., v0.9.n) should work."
-msgstr ""
-"如果你运行的是经过编辑的 vLLM 代码，vLLM 的版本可能会自动更改。例如，如果你基于 v0.9.n 运行编辑后的 vLLM，vLLM "
-"的版本可能会变为 v0.9.nxxx。在这种情况下，vLLM Ascend 中针对 v0.9.n 的补丁将无法按预期工作，因为 vLLM "
-"Ascend 无法区分你正在使用的 vLLM 版本。此时，你可以设置环境变量 `VLLM_VERSION` 来指定你使用的 vLLM "
-"版本，这样针对该版本（例如 v0.9.n）的补丁就应该能正常工作了。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po
@@ -1,381 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:1
-msgid "Quantization Adaptation Guide"
-msgstr "量化适配指南"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:3
-msgid ""
-"This document provides guidance for adapting quantization algorithms and "
-"models related to **ModelSlim**."
-msgstr "本文档为适配与 **ModelSlim** 相关的量化算法和模型提供指导。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:5
-msgid "Quantization Feature Introduction"
-msgstr "量化特性介绍"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:7
-msgid "Quantization Inference Process"
-msgstr "量化推理流程"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:9
-msgid ""
-"The current process for registering and obtaining quantization methods in"
-" vLLM Ascend is as follows:"
-msgstr "当前 vLLM Ascend 中注册和获取量化方法的流程如下："
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:11
-msgid "![get_quant_method](../../assets/quantization/get_quant_method.png)"
-msgstr "![get_quant_method](../../assets/quantization/get_quant_method.png)"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:11
-msgid "get_quant_method"
-msgstr "get_quant_method"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:13
-msgid ""
-"vLLM Ascend registers a custom Ascend quantization method. By configuring"
-" the `--quantization ascend` parameter (or `quantization=\"ascend\"` for "
-"offline), the quantization feature is enabled. When constructing the "
-"`quant_config`, the registered `AscendModelSlimConfig` is initialized and"
-" `get_quant_method` is called to obtain the quantization method "
-"corresponding to each weight part, stored in the `quant_method` "
-"attribute."
-msgstr ""
-"vLLM Ascend 注册了一个自定义的 Ascend 量化方法。通过配置 `--quantization ascend` 参数（或离线时使用 "
-"`quantization=\"ascend\"`），即可启用量化功能。在构建 `quant_config` 时，会初始化已注册的 "
-"`AscendModelSlimConfig`，并调用 `get_quant_method` 来获取每个权重部分对应的量化方法，存储在 "
-"`quant_method` 属性中。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:15
-msgid ""
-"Currently supported quantization methods include `AscendLinearMethod`, "
-"`AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding "
-"non-quantized methods:"
-msgstr ""
-"当前支持的量化方法包括 "
-"`AscendLinearMethod`、`AscendFusedMoEMethod`、`AscendEmbeddingMethod` "
-"及其对应的非量化方法："
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:17
-msgid "![quant_methods_overview](../../assets/quantization/quant_methods_overview.png)"
-msgstr "![quant_methods_overview](../../assets/quantization/quant_methods_overview.png)"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:17
-msgid "quant_methods_overview"
-msgstr "quant_methods_overview"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:19
-msgid ""
-"The quantization method base class defined by vLLM  and the overall call "
-"flow of quantization methods are as follows:"
-msgstr "vLLM 定义的量化方法基类及量化方法的整体调用流程如下："
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:21
-msgid "![quant_method_call_flow](../../assets/quantization/quant_method_call_flow.png)"
-msgstr "![quant_method_call_flow](../../assets/quantization/quant_method_call_flow.png)"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:21
-msgid "quant_method_call_flow"
-msgstr "quant_method_call_flow"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:23
-msgid ""
-"The `embedding` method is generally not implemented for quantization, "
-"focusing only on the other three methods."
-msgstr "`embedding` 方法通常不实现量化，仅关注其他三种方法。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:25
-msgid ""
-"The `create_weights` method is used for weight initialization; the "
-"`process_weights_after_loading` method is used for weight post-"
-"processing, such as transposition, format conversion, data type "
-"conversion, etc.; the `apply` method is used to perform activation "
-"quantization and quantized matrix multiplication calculations during the "
-"forward process."
-msgstr ""
-"`create_weights` 方法用于权重初始化；`process_weights_after_loading` "
-"方法用于权重后处理，例如转置、格式转换、数据类型转换等；`apply` 方法用于在前向传播过程中执行激活量化和量化矩阵乘法计算。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:27
-msgid ""
-"We need to implement the `create_weights`, "
-"`process_weights_after_loading`, and `apply` methods for different "
-"**layers** (**attention**, **mlp**, **MoE (Mixture of Experts)**)."
-msgstr ""
-"我们需要为不同的**层**（**attention**、**mlp**、**MoE (Mixture of Experts)**）实现 "
-"`create_weights`、`process_weights_after_loading` 和 `apply` 方法。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:29
-msgid ""
-"**Supplement**: When loading the model, the quantized model's description"
-" file **quant_model_description.json** needs to be read. This file "
-"describes the quantization configuration and parameters for each part of "
-"the model weights, for example:"
-msgstr ""
-"**补充说明**：加载模型时，需要读取量化模型的描述文件 "
-"**quant_model_description.json**。该文件描述了模型各部分权重的量化配置和参数，例如："
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:49
-msgid ""
-"Based on the above content, we present a brief description of the "
-"adaptation process for quantization algorithms and quantized models."
-msgstr "基于以上内容，我们对量化算法和量化模型的适配过程进行简要描述。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:51
-msgid "Quantization Algorithm Adaptation"
-msgstr "量化算法适配"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:53
-msgid ""
-"**Step 1: Algorithm Design**. Define the algorithm ID (e.g., "
-"`W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and"
-" design the quantization scheme (static/dynamic, "
-"pertensor/perchannel/pergroup)."
-msgstr ""
-"**步骤 1：算法设计**。定义算法 ID（例如 "
-"`W4A8_DYNAMIC`），确定支持的层（linear、moe、attention），并设计量化方案（静态/动态、pertensor/perchannel/pergroup）。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:54
-msgid ""
-"**Step 2: Registration**. Use the `@register_scheme` decorator in "
-"`vllm_ascend/quantization/methods/registry.py` to register your "
-"quantization scheme class."
-msgstr ""
-"**步骤 2：注册**。在 `vllm_ascend/quantization/methods/registry.py` 中使用 "
-"`@register_scheme` 装饰器注册您的量化方案类。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:68
-msgid ""
-"**Step 3: Implementation**. Create an algorithm implementation file, such"
-" as `vllm_ascend/quantization/methods/w4a8.py`, and implement the method "
-"class and logic."
-msgstr ""
-"**步骤 3：实现**。创建一个算法实现文件，例如 "
-"`vllm_ascend/quantization/methods/w4a8.py`，并实现方法类和逻辑。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:69
-msgid ""
-"**Step 4: Testing**. Use your algorithm to generate quantization "
-"configurations and verify correctness and performance on target models "
-"and hardware."
-msgstr "**步骤 4：测试**。使用您的算法生成量化配置，并在目标模型和硬件上验证正确性和性能。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:71
-msgid "Quantized Model Adaptation"
-msgstr "量化模型适配"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:73
-msgid ""
-"Adapting a new quantized model requires ensuring the following three "
-"points:"
-msgstr "适配一个新的量化模型需要确保以下三点："
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:75
-msgid "The original model has been successfully adapted in `vLLM Ascend`."
-msgstr "原始模型已在 `vLLM Ascend` 中成功适配。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:76
-msgid ""
-"**Fused Module Mapping**: Add the model's `model_type` to "
-"`packed_modules_model_mapping` in "
-"`vllm_ascend/quantization/modelslim_config.py` (e.g., `qkv_proj`, "
-"`gate_up_proj`, `experts`) to ensure sharding consistency and correct "
-"loading."
-msgstr ""
-"**融合模块映射**：将模型的 `model_type` 添加到 "
-"`vllm_ascend/quantization/modelslim_config.py` 中的 "
-"`packed_modules_model_mapping`（例如 "
-"`qkv_proj`、`gate_up_proj`、`experts`），以确保分片一致性和正确加载。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:96
-msgid ""
-"All quantization algorithms used by the quantized model have been "
-"integrated into the `quantization` module."
-msgstr "量化模型使用的所有量化算法都已集成到 `quantization` 模块中。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:98
-msgid "Currently Supported Quantization Algorithms"
-msgstr "当前支持的量化算法"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:100
-msgid ""
-"vLLM Ascend supports multiple quantization algorithms. The following "
-"table provides an overview of each quantization algorithm based on the "
-"implementation in the `vllm_ascend.quantization` module:"
-msgstr "vLLM Ascend 支持多种量化算法。下表基于 `vllm_ascend.quantization` 模块中的实现，概述了每种量化算法："
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Algorithm"
-msgstr "算法"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Weight"
-msgstr "权重"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Activation"
-msgstr "激活"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Weight Granularity"
-msgstr "权重粒度"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Activation Granularity"
-msgstr "激活粒度"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Type"
-msgstr "类型"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Description"
-msgstr "描述"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W4A16`"
-msgstr "`W4A16`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "INT4"
-msgstr "INT4"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "FP16/BF16"
-msgstr "FP16/BF16"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Per-Group"
-msgstr "Per-Group"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Per-Tensor"
-msgstr "Per-Tensor"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Static"
-msgstr "静态"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid ""
-"4-bit weight quantization with 16-bit activation precision, specifically "
-"designed for MoE model expert layers, supporting int32 format weight "
-"packing"
-msgstr "4位权重量化，16位激活精度，专为 MoE 模型专家层设计，支持 int32 格式权重打包"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W8A16`"
-msgstr "`W8A16`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "INT8"
-msgstr "INT8"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Per-Channel"
-msgstr "Per-Channel"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid ""
-"8-bit weight quantization with 16-bit activation precision, balancing "
-"accuracy and performance, suitable for linear layers"
-msgstr "8位权重量化，16位激活精度，平衡精度与性能，适用于线性层"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W8A8`"
-msgstr "`W8A8`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid ""
-"Static activation quantization, suitable for scenarios requiring high "
-"precision"
-msgstr "静态激活量化，适用于需要高精度的场景"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W8A8_DYNAMIC`"
-msgstr "`W8A8_DYNAMIC`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Per-Token"
-msgstr "Per-Token"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Dynamic"
-msgstr "动态"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Dynamic activation quantization with per-token scaling factor calculation"
-msgstr "动态激活量化，按 token 计算缩放因子"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W4A8_DYNAMIC`"
-msgstr "`W4A8_DYNAMIC`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid ""
-"Supports both direct per-channel quantization to 4-bit and two-step "
-"quantization (per-channel to 8-bit then per-group to 4-bit)"
-msgstr "支持直接按通道量化到4位，以及两步量化（先按通道量化到8位，再按组量化到4位）"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W4A4_FLATQUANT_DYNAMIC`"
-msgstr "`W4A4_FLATQUANT_DYNAMIC`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid ""
-"Uses FlatQuant for activation distribution smoothing before 4-bit dynamic"
-" quantization, with additional matrix multiplications for precision "
-"preservation"
-msgstr "在4位动态量化前使用 FlatQuant 平滑激活分布，并通过额外的矩阵乘法来保持精度"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "`W8A8_MIX`"
-msgstr "`W8A8_MIX`"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Per-Tensor/Token"
-msgstr "Per-Tensor/Token"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid "Mixed"
-msgstr "混合"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md
-msgid ""
-"We support two deployment modes: PD Colocation (dynamic quantization for "
-"both P and D) and PD Disaggregation (dynamic-quant P and static-quant D)"
-msgstr "我们支持两种部署模式：PD 共部署（P和D均使用动态量化）和 PD 分离部署（P使用动态量化，D使用静态量化）"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:112
-msgid ""
-"**Static vs Dynamic:** Static quantization uses pre-computed scaling "
-"factors with better performance, while dynamic quantization computes "
-"scaling factors on-the-fly for each token/activation tensor with higher "
-"precision."
-msgstr "**静态与动态：** 静态量化使用预计算的缩放因子，性能更优；而动态量化则为每个 token/激活张量实时计算缩放因子，精度更高。"
-
-#: ../../source/developer_guide/Design_Documents/quantization.md:114
-msgid ""
-"**Granularity:** Refers to the scope of scaling factor computation (e.g.,"
-" per-tensor, per-channel, per-group)."
-msgstr "**粒度：** 指缩放因子计算的范围（例如，per-tensor、per-channel、per-group）。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po
@@ -4,182 +4,184 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/contribution/index.md:107
+#: ../../developer_guide/contribution/index.md:107
 msgid "Index"
 msgstr "索引"

-#: ../../source/developer_guide/contribution/index.md:1
+#: ../../developer_guide/contribution/index.md:1
 msgid "Contributing"
-msgstr "贡献指南"
+msgstr "贡献"

-#: ../../source/developer_guide/contribution/index.md:3
-msgid "Building and Testing"
+#: ../../developer_guide/contribution/index.md:3
+msgid "Building and testing"
 msgstr "构建与测试"

-#: ../../source/developer_guide/contribution/index.md:5
+#: ../../developer_guide/contribution/index.md:4
 msgid ""
-"It's recommended to set up a local development environment to build vllm-"
-"ascend and run tests before you submit a PR."
-msgstr "建议在提交 PR 之前，先搭建本地开发环境来构建 vllm-ascend 并运行测试。"
+"It's recommended to set up a local development environment to build and test"
+" before you submit a PR."
+msgstr "建议先搭建本地开发环境来进行构建和测试，再提交 PR。"

-#: ../../source/developer_guide/contribution/index.md:8
-msgid "Set up a development environment"
-msgstr "设置开发环境"
+#: ../../developer_guide/contribution/index.md:7
+msgid "Setup development environment"
+msgstr "搭建开发环境"

-#: ../../source/developer_guide/contribution/index.md:10
+#: ../../developer_guide/contribution/index.md:9
 msgid ""
 "Theoretically, the vllm-ascend build is only supported on Linux because "
 "`vllm-ascend` dependency `torch_npu` only supports Linux."
-msgstr "理论上，vllm-ascend 的构建仅支持 Linux，因为其依赖项 `torch_npu` 仅支持 Linux。"
+msgstr ""
+"理论上，vllm-ascend 构建仅支持 Linux，因为 `vllm-ascend` 的依赖项 `torch_npu` 只支持 Linux。"

-#: ../../source/developer_guide/contribution/index.md:13
+#: ../../developer_guide/contribution/index.md:12
 msgid ""
-"But you can still set up a development environment on Linux/Windows/macOS"
-" for linting and running basic tests."
-msgstr "但你仍然可以在 Linux/Windows/macOS 上设置开发环境，用于代码规范检查和运行基本测试。"
+"But you can still set up dev env on Linux/Windows/macOS for linting and "
+"basic test as following commands:"
+msgstr "但你仍然可以在 Linux/Windows/macOS 上按照以下命令设置开发环境，用于代码规约检查和基本测试："

-#: ../../source/developer_guide/contribution/index.md:16
+#: ../../developer_guide/contribution/index.md:15
 msgid "Run lint locally"
-msgstr "本地运行代码检查"
+msgstr "在本地运行 lint"

-#: ../../source/developer_guide/contribution/index.md:35
+#: ../../developer_guide/contribution/index.md:33
 msgid "Run CI locally"
-msgstr "本地运行 CI"
+msgstr "本地运行CI"

-#: ../../source/developer_guide/contribution/index.md:37
-msgid ""
-"After completing \"Run lint\" setup, you can run CI (Continuous "
-"integration) locally:"
-msgstr "完成“运行代码检查”设置后，你可以在本地运行 CI（持续集成）："
+#: ../../developer_guide/contribution/index.md:35
+msgid "After complete \"Run lint\" setup, you can run CI locally:"
+msgstr "在完成“运行 lint”设置后，你可以在本地运行 CI："

-#: ../../source/developer_guide/contribution/index.md:62
+#: ../../developer_guide/contribution/index.md:61
 msgid "Submit the commit"
-msgstr "提交更改"
+msgstr "提交该提交"

-#: ../../source/developer_guide/contribution/index.md:69
-msgid "🎉 Congratulations! You have completed the development environment setup."
-msgstr "🎉 恭喜！您已完成开发环境的设置。"
+#: ../../developer_guide/contribution/index.md:68
+msgid ""
+"🎉 Congratulations! You have completed the development environment setup."
+msgstr "🎉 恭喜！你已经完成了开发环境的搭建。"

-#: ../../source/developer_guide/contribution/index.md:71
-msgid "Testing locally"
+#: ../../developer_guide/contribution/index.md:70
+msgid "Test locally"
 msgstr "本地测试"

-#: ../../source/developer_guide/contribution/index.md:73
+#: ../../developer_guide/contribution/index.md:72
 msgid ""
-"You can refer to [Testing](./testing.md)  to set up a testing environment"
-" and running tests locally."
-msgstr "你可以参考 [测试](./testing.md) 文档来设置测试环境并在本地运行测试。"
+"You can refer to [Testing](./testing.md) doc to help you setup testing "
+"environment and running tests locally."
+msgstr "你可以参考 [测试](./testing.md) 文档，帮助你搭建测试环境并在本地运行测试。"

-#: ../../source/developer_guide/contribution/index.md:75
+#: ../../developer_guide/contribution/index.md:74
 msgid "DCO and Signed-off-by"
-msgstr "DCO 与签署确认"
+msgstr "DCO 和签名确认"

-#: ../../source/developer_guide/contribution/index.md:77
+#: ../../developer_guide/contribution/index.md:76
 msgid ""
 "When contributing changes to this project, you must agree to the DCO. "
 "Commits must include a `Signed-off-by:` header which certifies agreement "
-"with the terms of the DCO (Developer Certificate of Origin)."
-msgstr "向本项目贡献更改时，您必须同意 DCO。提交必须包含 `Signed-off-by:` 标头，以证明您同意 DCO（开发者原创证书）的条款。"
+"with the terms of the DCO."
+msgstr "当为本项目贡献更改时，您必须同意 DCO。提交必须包含 `Signed-off-by:` 头部，以证明您同意 DCO 的条款。"

-#: ../../source/developer_guide/contribution/index.md:79
+#: ../../developer_guide/contribution/index.md:78
 msgid "Using `-s` with `git commit` will automatically add this header."
-msgstr "在 `git commit` 命令中使用 `-s` 参数会自动添加此标头。"
+msgstr "在使用 `git commit` 时加上 `-s` 参数会自动添加这个头部信息。"

-#: ../../source/developer_guide/contribution/index.md:81
+#: ../../developer_guide/contribution/index.md:80
 msgid "PR Title and Classification"
 msgstr "PR 标题与分类"

-#: ../../source/developer_guide/contribution/index.md:83
+#: ../../developer_guide/contribution/index.md:82
 msgid ""
 "Only specific types of PRs will be reviewed. The PR title is prefixed "
 "appropriately to indicate the type of change. Please use one of the "
 "following:"
-msgstr "只有特定类型的 PR 会被审核。PR 标题应使用适当的前缀来指明更改类型。请使用以下前缀之一："
+msgstr "只有特定类型的 PR 会被审核。PR 标题应使用合适的前缀以指明更改类型。请使用以下之一："

-#: ../../source/developer_guide/contribution/index.md:85
+#: ../../developer_guide/contribution/index.md:84
 msgid "`[Attention]` for new features or optimization in attention."
-msgstr "`[Attention]` 用于注意力机制的新功能或优化。"
+msgstr "`[Attention]` 用于注意力机制中新特性或优化。"

-#: ../../source/developer_guide/contribution/index.md:86
+#: ../../developer_guide/contribution/index.md:85
 msgid "`[Communicator]` for new features or optimization in communicators."
-msgstr "`[Communicator]` 用于通信器的新功能或优化。"
+msgstr "`[Communicator]` 适用于通信器中的新特性或优化。"

-#: ../../source/developer_guide/contribution/index.md:87
+#: ../../developer_guide/contribution/index.md:86
 msgid "`[ModelRunner]` for new features or optimization in model runner."
-msgstr "`[ModelRunner]` 用于模型运行器的新功能或优化。"
+msgstr "`[ModelRunner]` 用于模型运行器中的新功能或优化。"

-#: ../../source/developer_guide/contribution/index.md:88
+#: ../../developer_guide/contribution/index.md:87
 msgid "`[Platform]` for new features or optimization in platform."
-msgstr "`[Platform]` 用于平台的新功能或优化。"
+msgstr "`[Platform]` 用于平台中新功能或优化。"

-#: ../../source/developer_guide/contribution/index.md:89
+#: ../../developer_guide/contribution/index.md:88
 msgid "`[Worker]` for new features or optimization in worker."
-msgstr "`[Worker]` 用于工作器的新功能或优化。"
+msgstr "`[Worker]` 用于 worker 的新功能或优化。"

-#: ../../source/developer_guide/contribution/index.md:90
+#: ../../developer_guide/contribution/index.md:89
 msgid ""
 "`[Core]` for new features or optimization  in the core vllm-ascend logic "
 "(such as platform, attention, communicators, model runner)"
-msgstr "`[Core]` 用于核心 vllm-ascend 逻辑中的新功能或优化（例如平台、注意力机制、通信器、模型运行器）。"
+msgstr "`[Core]` 用于核心 vllm-ascend 逻辑中的新特性或优化（例如平台、注意力机制、通信器、模型运行器）。"

-#: ../../source/developer_guide/contribution/index.md:91
-msgid "`[Kernel]` for changes affecting compute kernels and ops."
-msgstr "`[Kernel]` 用于影响计算内核和操作的更改。"
+#: ../../developer_guide/contribution/index.md:90
+msgid "`[Kernel]` changes affecting compute kernels and ops."
+msgstr "`[Kernel]` 影响计算内核和操作的更改。"

-#: ../../source/developer_guide/contribution/index.md:92
+#: ../../developer_guide/contribution/index.md:91
 msgid "`[Bugfix]` for bug fixes."
-msgstr "`[Bugfix]` 用于错误修复。"
+msgstr "`[Bugfix]` 用于表示错误修复。"

-#: ../../source/developer_guide/contribution/index.md:93
+#: ../../developer_guide/contribution/index.md:92
 msgid "`[Doc]` for documentation fixes and improvements."
 msgstr "`[Doc]` 用于文档修复和改进。"

-#: ../../source/developer_guide/contribution/index.md:94
+#: ../../developer_guide/contribution/index.md:93
 msgid "`[Test]` for tests (such as unit tests)."
-msgstr "`[Test]` 用于测试（例如单元测试）。"
+msgstr "`[Test]` 用于测试（如单元测试）。"

-#: ../../source/developer_guide/contribution/index.md:95
+#: ../../developer_guide/contribution/index.md:94
 msgid "`[CI]` for build or continuous integration improvements."
 msgstr "`[CI]` 用于构建或持续集成的改进。"

-#: ../../source/developer_guide/contribution/index.md:96
+#: ../../developer_guide/contribution/index.md:95
 msgid ""
 "`[Misc]` for PRs that do not fit the above categories. Please use this "
 "sparingly."
-msgstr "`[Misc]` 用于不属于上述类别的 PR。请谨慎使用此标签。"
+msgstr "对于不属于上述类别的 PR，请使用 `[Misc]`。请谨慎使用此标签。"

-#: ../../source/developer_guide/contribution/index.md:99
+#: ../../developer_guide/contribution/index.md:98
 msgid ""
 "If the PR spans more than one category, please include all relevant "
 "prefixes."
-msgstr "如果 PR 涉及多个类别，请包含所有相关的前缀。"
+msgstr "如果拉取请求（PR）涵盖多个类别，请包含所有相关的前缀。"

-#: ../../source/developer_guide/contribution/index.md:102
+#: ../../developer_guide/contribution/index.md:101
 msgid "Others"
 msgstr "其他"

-#: ../../source/developer_guide/contribution/index.md:104
+#: ../../developer_guide/contribution/index.md:103
 msgid ""
 "You may find more information about contributing to vLLM Ascend backend "
 "plugin on "
-"[<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing). If "
-"you encounter any problems while contributing, feel free to submit a PR "
-"to improve the documentation to help other developers."
+"[<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html)."
+" If you find any problem when contributing, you can feel free to submit a PR"
+" to improve the doc to help other developers."
 msgstr ""
-"你可以在 [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing) "
-"上找到有关为 vLLM Ascend 后端插件做贡献的更多信息。如果在贡献过程中遇到任何问题，欢迎随时提交 PR 来改进文档，以帮助其他开发者。"
+"你可以在 "
+"[<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html)"
+" 上找到有关为 vLLM Ascend 后端插件做贡献的更多信息。如果你在贡献过程中遇到任何问题，欢迎随时提交 PR 来改进文档，以帮助其他开发者。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po
@@ -1,248 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:1
-msgid "Multi Node Test"
-msgstr "多节点测试"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:3
-msgid ""
-"Multi-Node CI is designed to test distributed scenarios of very large "
-"models, eg: disaggregated_prefill multi DP across multi nodes and so on."
-msgstr ""
-"多节点CI旨在测试超大规模模型的分布式场景，例如：跨多节点的解耦预填充（disaggregated_prefill）、多数据并行（multi "
-"DP）等。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:5
-msgid "How it works"
-msgstr "工作原理"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:7
-msgid ""
-"The following picture shows the basic deployment view of the multi-node "
-"CI mechanism. It shows how the GitHub action interacts with "
-"[lws](https://lws.sigs.k8s.io/docs/overview/) (a kind of kubernetes crd "
-"resource)."
-msgstr ""
-"下图展示了多节点CI机制的基本部署视图。它说明了GitHub "
-"Action如何与[lws](https://lws.sigs.k8s.io/docs/overview/)（一种Kubernetes "
-"CRD资源）进行交互。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:9
-msgid "![alt text](../../assets/deployment.png)"
-msgstr "![替代文本](../../assets/deployment.png)"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:9
-#: ../../source/developer_guide/contribution/multi_node_test.md:13
-msgid "alt text"
-msgstr "替代文本"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:11
-msgid ""
-"From the workflow perspective, we can see how the final test script is "
-"executed, The key point is that these two [lws.yaml and "
-"run.sh](https://github.com/vllm-project/vllm-"
-"ascend/tree/main/tests/e2e/nightly/multi_node/scripts), The former "
-"defines how our k8s cluster is pulled up, and the latter defines the "
-"entry script when the pod is started, Each node executes different logic "
-"according to the "
-"[LWS_WORKER_INDEX](https://lws.sigs.k8s.io/docs/reference/labels-"
-"annotations-and-environment-variables/) environment variable, so that "
-"multiple nodes can form a distributed cluster to perform tasks."
-msgstr ""
-"从工作流的角度，我们可以看到最终的测试脚本是如何执行的。关键在于这两个文件：[lws.yaml和run.sh](https://github.com"
-"/vllm-project/vllm-"
-"ascend/tree/main/tests/e2e/nightly/multi_node/scripts)。前者定义了我们的k8s集群如何被拉起，后者定义了Pod启动时的入口脚本。每个节点根据[LWS_WORKER_INDEX](https://lws.sigs.k8s.io/docs/reference"
-"/labels-annotations-and-environment-"
-"variables/)环境变量执行不同的逻辑，从而使多个节点能够组成一个分布式集群来执行任务。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:13
-msgid "![alt text](../../assets/workflow.png)"
-msgstr "![替代文本](../../assets/workflow.png)"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:15
-msgid "How to contribute"
-msgstr "如何贡献"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:17
-msgid "Upload custom weights"
-msgstr "上传自定义权重"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:19
-msgid ""
-"If you need customized weights, for example, you quantized a w8a8 weight "
-"for DeepSeek-V3 and you want your weight to run on CI, uploading weights "
-"to ModelScope's [vllm-ascend](https://www.modelscope.cn/organization"
-"/vllm-ascend) organization is welcome. If you do not have permission to "
-"upload, please contact @Potabk"
-msgstr ""
-"如果您需要自定义权重，例如，您为DeepSeek-V3量化了一个w8a8权重，并希望您的权重能在CI上运行，欢迎将权重上传至ModelScope的"
-"[vllm-ascend](https://www.modelscope.cn/organization/vllm-"
-"ascend)组织。如果您没有上传权限，请联系@Potabk。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:21
-msgid "Add config yaml"
-msgstr "添加配置YAML"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:23
-msgid ""
-"As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-"
-"ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106)"
-" shows, a k8s pod startup means traversing all *.yaml files in the "
-"[directory](https://github.com/vllm-project/vllm-"
-"ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and "
-"executing according to different configurations, so what we need to do is"
-" just add \"yamls\" like [DeepSeek-V3.yaml](https://github.com/vllm-"
-"project/vllm-"
-"ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml)."
-msgstr ""
-"如入口脚本[run.sh](https://github.com/vllm-project/vllm-"
-"ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106)所示，一个k8s"
-" Pod的启动意味着遍历[目录](https://github.com/vllm-project/vllm-"
-"ascend/tree/main/tests/e2e/nightly/multi_node/config/)中的所有*.yaml文件，并根据不同的配置读取和执行。因此，我们需要做的就是添加类似[DeepSeek-V3.yaml](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml)的\"yaml\"文件。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:25
-msgid ""
-"Suppose you have **2 nodes** running a 1P1D setup (1 Prefillers + 1 "
-"Decoder):"
-msgstr "假设您有**2个节点**运行1P1D设置（1个预填充器 + 1个解码器）："
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:27
-msgid "you may add a config file looks like:"
-msgstr "您可以添加一个类似这样的配置文件："
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:73
-msgid "Add the case to nightly workflow"
-msgstr "将用例添加到夜间工作流"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:75
-msgid ""
-"Currently, the multi-node test workflow is defined in the "
-"[nightly_test_a3.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)"
-msgstr ""
-"目前，多节点测试工作流定义在[nightly_test_a3.yaml](https://github.com/vllm-project"
-"/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)中。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:110
-msgid ""
-"The matrix above defines all the parameters required to add a multi-"
-"machine use case. The parameters worth noting (if you are adding a new "
-"use case) are `size` and the path to the yaml configuration file. The "
-"former defines the number of nodes required for your use case, and the "
-"latter defines the path to the configuration file you have completed in "
-"step 2."
-msgstr "上面的矩阵定义了添加一个多机用例所需的所有参数。值得注意的参数（如果您正在添加一个新用例）是`size`和yaml配置文件的路径。前者定义了您的用例所需的节点数量，后者定义了您在步骤2中完成的配置文件的路径。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:112
-msgid "Run Multi-Node tests locally"
-msgstr "本地运行多节点测试"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:114
-msgid "1. Use kubernetes"
-msgstr "1. 使用Kubernetes"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:116
-msgid ""
-"This section assumes that you already have a "
-"[Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment "
-"locally. Then you can easily start our test with one click."
-msgstr ""
-"本节假设您本地已经有一个[Kubernetes](https://kubernetes.io/docs/setup/) "
-"NPU集群环境。然后您可以轻松地一键启动我们的测试。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:118
-msgid "Step 1. Install LWS CRD resources"
-msgstr "步骤 1. 安装LWS CRD资源"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:120
-msgid ""
-"See <https://lws.sigs.k8s.io/docs/installation/> Which can be used as a "
-"reference"
-msgstr "参考<https://lws.sigs.k8s.io/docs/installation/>"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:122
-msgid "Step 2. Deploy the following yaml file `lws.yaml` as needed"
-msgstr "步骤 2. 按需部署以下yaml文件`lws.yaml`"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:258
-msgid "Verify the status of the pods:"
-msgstr "验证Pod的状态："
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:264
-msgid "Should get an output similar to this:"
-msgstr "应该会得到类似这样的输出："
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:272
-msgid "Verify that the distributed inference works:"
-msgstr "验证分布式推理是否正常工作："
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:278
-msgid "Should get something similar to this:"
-msgstr "应该会得到类似这样的结果："
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:312
-msgid "2. Test without kubernetes"
-msgstr "2. 不使用Kubernetes进行测试"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:314
-msgid ""
-"Since our script is Kubernetes-friendly, we need to actively pass in some"
-" cluster information if you don't have a Kubernetes environment."
-msgstr "由于我们的脚本对Kubernetes友好，如果您没有Kubernetes环境，则需要主动传入一些集群信息。"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:316
-msgid "Step 1. Add cluster_hosts to config yamls"
-msgstr "步骤 1. 向配置YAML文件添加cluster_hosts"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:318
-msgid ""
-"Modify on every cluster host, commands just like "
-"[DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/e760aae1df7814073a4180172385505c1ec0fd83/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml#L25)"
-" after the configure item `num_nodes` , for example:   `cluster_hosts: "
-"[\"xxx.xxx.xxx.188\", \"xxx.xxx.xxx.212\"]`"
-msgstr ""
-"在每个集群主机上进行修改，就像[DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/e760aae1df7814073a4180172385505c1ec0fd83/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml#L25)那样，在配置项`num_nodes`之后添加，例如：`cluster_hosts:"
-" [\"xxx.xxx.xxx.188\", \"xxx.xxx.xxx.212\"]`"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:321
-msgid "Step 2. Install develop environment"
-msgstr "步骤 2. 安装开发环境"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:322
-msgid "Install vllm-ascend develop packages on every cluster host"
-msgstr "在每个集群主机上安装vllm-ascend开发包"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:329
-msgid "Install AISBench on the first host(leader node) in cluster_hosts"
-msgstr "在cluster_hosts中的第一个主机（主节点）上安装AISBench"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:341
-msgid "Step 3. Running test locally"
-msgstr "步骤 3. 本地运行测试"
-
-#: ../../source/developer_guide/contribution/multi_node_test.md:343
-msgid "Run the script on **each node separately**"
-msgstr "在**每个节点上分别**运行脚本"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/testing.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/testing.po
@@ -4,289 +4,234 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/contribution/testing.md:1
+#: ../../developer_guide/contribution/testing.md:1
 msgid "Testing"
 msgstr "测试"

-#: ../../source/developer_guide/contribution/testing.md:3
+#: ../../developer_guide/contribution/testing.md:3
 msgid ""
-"This document explains how to write E2E tests and unit tests to verify "
-"the implementation of your feature."
-msgstr "本文档介绍如何编写端到端测试和单元测试，以验证您实现的功能。"
+"This secition explains how to write e2e tests and unit tests to verify the "
+"implementation of your feature."
+msgstr "本节介绍如何编写端到端测试和单元测试，以验证你的功能实现。"

-#: ../../source/developer_guide/contribution/testing.md:5
-msgid "Set up a test environment"
+#: ../../developer_guide/contribution/testing.md:5
+msgid "Setup test environment"
 msgstr "设置测试环境"

-#: ../../source/developer_guide/contribution/testing.md:7
+#: ../../developer_guide/contribution/testing.md:7
 msgid ""
-"The fastest way to set up a test environment is to use the main branch's "
+"The fastest way to setup test environment is to use the main branch "
 "container image:"
-msgstr "设置测试环境最快的方法是使用 main 分支的容器镜像："
+msgstr "搭建测试环境最快的方法是使用 main 分支的容器镜像："

-#: ../../source/developer_guide/contribution/testing.md
+#: ../../developer_guide/contribution/testing.md
 msgid "Local (CPU)"
 msgstr "本地（CPU）"

-#: ../../source/developer_guide/contribution/testing.md:18
-msgid "You can run the unit tests on CPUs with the following steps:"
-msgstr "您可以按照以下步骤在 CPU 上运行单元测试："
+#: ../../developer_guide/contribution/testing.md:18
+msgid "You can run the unit tests on CPU with the following steps:"
+msgstr "你可以按照以下步骤在 CPU 上运行单元测试："

-#: ../../source/developer_guide/contribution/testing.md
+#: ../../developer_guide/contribution/testing.md
 msgid "Single card"
-msgstr "单卡"
+msgstr "单张卡片"

-#: ../../source/developer_guide/contribution/testing.md:96
-#: ../../source/developer_guide/contribution/testing.md:135
-msgid "After starting the container, you should install the required packages:"
-msgstr "启动容器后，您应该安装所需的软件包："
+#: ../../developer_guide/contribution/testing.md:85
+#: ../../developer_guide/contribution/testing.md:123
+msgid ""
+"After starting the container, you should install the required packages:"
+msgstr "启动容器后，你应该安装所需的软件包："

-#: ../../source/developer_guide/contribution/testing.md
+#: ../../developer_guide/contribution/testing.md
 msgid "Multi cards"
 msgstr "多卡"

-#: ../../source/developer_guide/contribution/testing.md:149
+#: ../../developer_guide/contribution/testing.md:137
 msgid "Running tests"
 msgstr "运行测试"

-#: ../../source/developer_guide/contribution/testing.md:151
-msgid "Unit tests"
+#: ../../developer_guide/contribution/testing.md:139
+msgid "Unit test"
 msgstr "单元测试"

-#: ../../source/developer_guide/contribution/testing.md:153
+#: ../../developer_guide/contribution/testing.md:141
 msgid "There are several principles to follow when writing unit tests:"
-msgstr "编写单元测试时需要遵循以下几个原则："
+msgstr "编写单元测试时需要遵循几个原则："

-#: ../../source/developer_guide/contribution/testing.md:155
+#: ../../developer_guide/contribution/testing.md:143
 msgid ""
-"The test file path should be consistent with the source file and start "
-"with the `test_` prefix, such as: `vllm_ascend/worker/worker.py` --> "
+"The test file path should be consistent with source file and start with "
+"`test_` prefix, such as: `vllm_ascend/worker/worker.py` --> "
 "`tests/ut/worker/test_worker.py`"
 msgstr ""
-"测试文件路径应与源文件保持一致，并以 `test_` 前缀开头，例如：`vllm_ascend/worker/worker.py` --> "
+"测试文件的路径应与源文件保持一致，并以 `test_` 前缀开头，例如：`vllm_ascend/worker/worker.py` --> "
 "`tests/ut/worker/test_worker.py`"

-#: ../../source/developer_guide/contribution/testing.md:156
+#: ../../developer_guide/contribution/testing.md:144
 msgid ""
-"The vLLM Ascend test uses unittest framework. See [the Python unittest "
-"documentation](https://docs.python.org/3/library/unittest.html#module-"
-"unittest) to understand how to write unit tests."
+"The vLLM Ascend test are using unittest framework, see "
+"[here](https://docs.python.org/3/library/unittest.html#module-unittest) to "
+"understand how to write unit tests."
 msgstr ""
-"vLLM Ascend 测试使用 unittest 框架。请参阅 [Python unittest "
-"文档](https://docs.python.org/3/library/unittest.html#module-unittest) 以了解如何编写单元测试。"
+"vLLM Ascend 测试使用 unittest "
+"框架，参见[这里](https://docs.python.org/3/library/unittest.html#module-"
+"unittest)了解如何编写单元测试。"

-#: ../../source/developer_guide/contribution/testing.md:157
+#: ../../developer_guide/contribution/testing.md:145
 msgid ""
-"All unit tests can be run on CPUs, so you must mock the device-related "
-"functions on the host."
-msgstr "所有单元测试都可以在 CPU 上运行，因此您必须在主机上模拟与设备相关的函数。"
+"All unit tests can be run on CPU, so you must mock the device-related "
+"function to host."
+msgstr "所有单元测试都可以在 CPU 上运行，因此你必须将与设备相关的函数模拟为 host。"

-#: ../../source/developer_guide/contribution/testing.md:158
+#: ../../developer_guide/contribution/testing.md:146
 msgid ""
-"Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project"
-"/vllm-ascend/blob/main/tests/ut/test_ascend_config.py)."
+"Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py)."
 msgstr ""
 "示例：[tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-"
 "ascend/blob/main/tests/ut/test_ascend_config.py)。"

-#: ../../source/developer_guide/contribution/testing.md:159
+#: ../../developer_guide/contribution/testing.md:147
 msgid "You can run the unit tests using `pytest`:"
-msgstr "您可以使用 `pytest` 运行单元测试："
+msgstr "你可以使用 `pytest` 运行单元测试："

-#: ../../source/developer_guide/contribution/testing.md
-msgid "Single-card"
-msgstr "单卡"
+#: ../../developer_guide/contribution/testing.md
+msgid "Multi cards test"
+msgstr "多卡测试"

-#: ../../source/developer_guide/contribution/testing.md
-msgid "Multi-card"
-msgstr "多卡"
-
-#: ../../source/developer_guide/contribution/testing.md:206
+#: ../../developer_guide/contribution/testing.md:192
 msgid "E2E test"
 msgstr "端到端测试"

-#: ../../source/developer_guide/contribution/testing.md:208
+#: ../../developer_guide/contribution/testing.md:194
 msgid ""
-"Although vllm-ascend CI provides E2E tests on Ascend CI (for example, "
-"[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), "
-"[schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), "
-"[pr_test_full.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/pr_test_full.yaml)), you can run them "
-"locally."
+"Although vllm-ascend CI provide [e2e test](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on "
+"Ascend CI, you can run it locally."
 msgstr ""
-"虽然 vllm-ascend CI 在 Ascend CI 上提供了端到端测试（例如，[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml)、[schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)、[pr_test_full.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/pr_test_full.yaml)），但您也可以在本地运行它们。"
+"虽然 vllm-ascend CI 在 Ascend CI 上提供了 [端到端测试](https://github.com/vllm-"
+"project/vllm-"
+"ascend/blob/main/.github/workflows/vllm_ascend_test.yaml)，你也可以在本地运行它。"

-#: ../../source/developer_guide/contribution/testing.md:218
-msgid "You can't run the E2E test on CPUs."
-msgstr "您无法在 CPU 上运行端到端测试。"
+#: ../../developer_guide/contribution/testing.md:204
+msgid "You can't run e2e test on CPU."
+msgstr "你无法在 CPU 上运行 e2e 测试。"

-#: ../../source/developer_guide/contribution/testing.md:257
+#: ../../developer_guide/contribution/testing.md:240
 msgid ""
-"This will reproduce the E2E test. See "
+"This will reproduce e2e test: "
 "[vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-"
 "ascend/blob/main/.github/workflows/vllm_ascend_test.yaml)."
 msgstr ""
-"这将复现端到端测试。请参阅 [vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-"
+"这将复现端到端测试：[vllm_ascend_test.yaml](https://github.com/vllm-project/vllm-"
 "ascend/blob/main/.github/workflows/vllm_ascend_test.yaml)。"

-#: ../../source/developer_guide/contribution/testing.md:259
-msgid ""
-"For running nightly multi-node test cases locally, refer to the `Running "
-"Locally` section in [Multi Node Test](./multi_node_test.md)."
-msgstr "要在本地运行夜间多节点测试用例，请参阅 [多节点测试](./multi_node_test.md) 中的 `本地运行` 部分。"
+#: ../../developer_guide/contribution/testing.md:242
+msgid "E2E test example:"
+msgstr "E2E 测试示例："

-#: ../../source/developer_guide/contribution/testing.md:261
-msgid "E2E test example"
-msgstr "端到端测试示例"
-
-#: ../../source/developer_guide/contribution/testing.md:263
+#: ../../developer_guide/contribution/testing.md:244
 msgid ""
 "Offline test example: "
-"[`tests/e2e/singlecard/test_offline_inference.py`](https://github.com"
-"/vllm-project/vllm-"
+"[`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-"
+"project/vllm-"
 "ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)"
 msgstr ""
-"离线测试示例：[`tests/e2e/singlecard/test_offline_inference.py`](https://github.com"
-"/vllm-project/vllm-"
+"离线测试示例：[`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-"
+"project/vllm-"
 "ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)"

-#: ../../source/developer_guide/contribution/testing.md:264
+#: ../../developer_guide/contribution/testing.md:245
 msgid ""
 "Online test examples: "
-"[`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)"
+"[`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)"
 msgstr ""
-"在线测试示例：[`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)"
+"在线测试示例：[`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)"

-#: ../../source/developer_guide/contribution/testing.md:265
+#: ../../developer_guide/contribution/testing.md:246
 msgid ""
 "Correctness test example: "
-"[`tests/e2e/singlecard/test_aclgraph_accuracy.py`](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/tests/e2e/singlecard/test_aclgraph_accuracy.py)"
+"[`tests/e2e/singlecard/test_aclgraph_accuracy.py`](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph_accuracy.py)"
 msgstr ""
-"正确性测试示例：[`tests/e2e/singlecard/test_aclgraph_accuracy.py`](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/tests/e2e/singlecard/test_aclgraph_accuracy.py)"
+"正确性测试示例：[`tests/e2e/singlecard/test_aclgraph_accuracy.py`](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph_accuracy.py)"

-#: ../../source/developer_guide/contribution/testing.md:267
+#: ../../developer_guide/contribution/testing.md:247
 msgid ""
-"The CI resource is limited, and you might need to reduce the number of "
-"layers of a model. Below is an example of how to generate a reduced layer"
-" model:"
-msgstr "CI 资源有限，您可能需要减少模型的层数。以下是如何生成缩减层数模型的示例："
+"Reduced Layer model test example: [test_torchair_graph_mode.py - "
+"DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-"
+"ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)"
+msgstr ""
+"简化层模型测试示例：[test_torchair_graph_mode.py - "
+"DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-"
+"ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)"

-#: ../../source/developer_guide/contribution/testing.md:268
+#: ../../developer_guide/contribution/testing.md:249
 msgid ""
-"Fork the original model repo in modelscope. All the files in the repo "
-"except for weights are required."
-msgstr "在 ModelScope 中 Fork 原始模型仓库。需要仓库中除权重文件外的所有文件。"
+"The CI resource is limited, you might need to reduce layer number of the "
+"model, below is an example of how to generate a reduced layer model:"
+msgstr "CI 资源有限，您可能需要减少模型的层数，下面是一个生成减少层数模型的示例："

-#: ../../source/developer_guide/contribution/testing.md:269
+#: ../../developer_guide/contribution/testing.md:250
+msgid ""
+"Fork the original model repo in modelscope, we need all the files in the "
+"repo except for weights."
+msgstr "在 modelscope 中 fork 原始模型仓库，我们需要仓库中的所有文件，除了权重文件。"
+
+#: ../../developer_guide/contribution/testing.md:251
 #, python-brace-format
 msgid ""
 "Set `num_hidden_layers` to the expected number of layers, e.g., "
 "`{\"num_hidden_layers\": 2,}`"
 msgstr "将 `num_hidden_layers` 设置为期望的层数，例如 `{\"num_hidden_layers\": 2,}`"

-#: ../../source/developer_guide/contribution/testing.md:270
+#: ../../developer_guide/contribution/testing.md:252
 msgid ""
 "Copy the following python script as `generate_random_weight.py`. Set the "
-"relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and "
-"`DIST_MODEL_PATH` as needed:"
+"relevant parameters `MODEL_LOCAL_PATH`, `DIST_DTYPE` and `DIST_MODEL_PATH` "
+"as needed:"
 msgstr ""
-"将以下 Python 脚本复制为 `generate_random_weight.py`。根据需要设置相关参数 `MODEL_LOCAL_PATH`、`DIST_DTYPE` 和 `DIST_MODEL_PATH`："
+"将以下 Python 脚本复制为 `generate_random_weight.py`。根据需要设置相关参数 "
+"`MODEL_LOCAL_PATH`、`DIST_DTYPE` 和 `DIST_MODEL_PATH`："

-#: ../../source/developer_guide/contribution/testing.md:288
-msgid "View CI log summary in GitHub Actions"
-msgstr "在 GitHub Actions 中查看 CI 日志摘要"
-
-#: ../../source/developer_guide/contribution/testing.md:290
-msgid ""
-"After a CI job finishes, you can open the corresponding GitHub Actions "
-"job page and check the `Summary` tab to view the generated CI log "
-"summary."
-msgstr "CI 作业完成后，您可以打开相应的 GitHub Actions 作业页面，并查看 `Summary` 选项卡以查看生成的 CI 日志摘要。"
-
-#: ../../source/developer_guide/contribution/testing.md:293
-msgid "![GitHub Actions CI log summary](../../assets/ci_log_summary.png)"
-msgstr "![GitHub Actions CI 日志摘要](../../assets/ci_log_summary.png)"
-
-#: ../../source/developer_guide/contribution/testing.md:293
-msgid "GitHub Actions CI log summary"
-msgstr "GitHub Actions CI 日志摘要"
-
-#: ../../source/developer_guide/contribution/testing.md:295
-msgid ""
-"The summary is intended to help developers triage failures more quickly. "
-"It may include:"
-msgstr "该摘要旨在帮助开发者更快地排查故障。它可能包括："
-
-#: ../../source/developer_guide/contribution/testing.md:297
-msgid "failed test files"
-msgstr "失败的测试文件"
-
-#: ../../source/developer_guide/contribution/testing.md:298
-msgid "failed test cases"
-msgstr "失败的测试用例"
-
-#: ../../source/developer_guide/contribution/testing.md:299
-msgid "distinct root-cause errors"
-msgstr "不同的根本原因错误"
-
-#: ../../source/developer_guide/contribution/testing.md:300
-msgid "short error context extracted from the job log"
-msgstr "从作业日志中提取的简短错误上下文"
-
-#: ../../source/developer_guide/contribution/testing.md:302
-msgid ""
-"This summary is generated from the job log by "
-"`/.github/workflows/scripts/ci_log_summary_v2.py` for unit-test and e2e "
-"workflows."
-msgstr "该摘要是由 `/.github/workflows/scripts/ci_log_summary_v2.py` 从作业日志中为单元测试和端到端测试工作流生成的。"
-
-#: ../../source/developer_guide/contribution/testing.md:305
+#: ../../developer_guide/contribution/testing.md:270
 msgid "Run doctest"
 msgstr "运行 doctest"

-#: ../../source/developer_guide/contribution/testing.md:307
+#: ../../developer_guide/contribution/testing.md:272
 msgid ""
-"vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to"
-" run all doctests in the doc files. The doctest is a good way to make "
-"sure docs stay current and examples remain executable, which can be run "
+"vllm-ascend provides a `vllm-ascend/tests/e2e/run_doctests.sh` command to "
+"run all doctests in the doc files. The doctest is a good way to make sure "
+"the docs are up to date and the examples are executable, you can run it "
 "locally as follows:"
 msgstr ""
-"vllm-ascend 提供了一个 `vllm-ascend/tests/e2e/run_doctests.sh` 命令来运行文档文件中的所有 doctest。doctest "
-"是确保文档保持最新且示例保持可执行性的好方法，可以按如下方式在本地运行："
+"vllm-ascend 提供了一个 `vllm-ascend/tests/e2e/run_doctests.sh` 命令，用于运行文档文件中的所有 "
+"doctest。doctest 是确保文档保持最新且示例可执行的好方法，你可以按照以下方式在本地运行它："

-#: ../../source/developer_guide/contribution/testing.md:315
+#: ../../developer_guide/contribution/testing.md:280
 msgid ""
-"This will reproduce the same environment as the CI. See "
+"This will reproduce the same environment as the CI: "
 "[vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-"
 "ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml)."
 msgstr ""
-"这将复现与 CI 相同的环境。请参阅 [vllm_ascend_doctest.yaml](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml)。"
+"这将复现与 CI 相同的环境：[vllm_ascend_doctest.yaml](https://github.com/vllm-"
+"project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_doctest.yaml)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po
@@ -1,252 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:1
-msgid "Using AISBench"
-msgstr "使用 AISBench"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:3
-msgid ""
-"This document guides you to conduct accuracy testing using "
-"[AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench "
-"provides accuracy and performance evaluation for many datasets."
-msgstr ""
-"本文档指导您如何使用 [AISBench](https://gitee.com/aisbench/benchmark/tree/master) "
-"进行精度测试。AISBench 为许多数据集提供了精度和性能评估。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:5
-msgid "Online Server"
-msgstr "在线服务器"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:7
-msgid "1. Start the vLLM server"
-msgstr "1.启动 vLLM 服务器"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:9
-msgid "You can run docker container to start the vLLM server on a single NPU:"
-msgstr "您可以运行 docker 容器在单个 NPU 上启动 vLLM 服务器："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:37
-msgid "Run the vLLM server in the docker."
-msgstr "在 docker 中运行 vLLM 服务器。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:45
-msgid ""
-"`--max_model_len` should be greater than `35000`, this will be suitable "
-"for most datasets. Otherwise the accuracy evaluation may be affected."
-msgstr "`--max_model_len` 应大于 `35000`，这适用于大多数数据集。否则可能会影响精度评估。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:48
-msgid "The vLLM server is started successfully, if you see logs as below:"
-msgstr "如果看到如下日志，则 vLLM 服务器启动成功："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:56
-msgid "2. Run different datasets using AISBench"
-msgstr "2.使用 AISBench 运行不同数据集"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:58
-msgid "Install AISBench"
-msgstr "安装 AISBench"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:60
-msgid ""
-"Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for"
-" details. Install AISBench from source."
-msgstr ""
-"详情请参考 [AISBench](https://gitee.com/aisbench/benchmark/tree/master)。从源码安装 "
-"AISBench。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:69
-msgid "Install extra AISBench dependencies."
-msgstr "安装额外的 AISBench 依赖项。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:76
-msgid "Run `ais_bench -h` to check the installation."
-msgstr "运行 `ais_bench -h` 以检查安装。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:78
-msgid "Download Dataset"
-msgstr "下载数据集"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:80
-msgid "You can choose one or multiple datasets to execute accuracy evaluation."
-msgstr "您可以选择一个或多个数据集来执行精度评估。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:82
-msgid "`C-Eval` dataset."
-msgstr "`C-Eval` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:84
-msgid ""
-"Take `C-Eval` dataset as an example. You can refer to "
-"[Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets)"
-" for more datasets. Each dataset has a `README.md` with detailed download"
-" and installation instructions."
-msgstr ""
-"以 `C-Eval` 数据集为例。更多数据集请参考 "
-"[Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets)。每个数据集都有一个"
-" `README.md` 文件，包含详细的下载和安装说明。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:86
-msgid "Download dataset and install it to specific path."
-msgstr "下载数据集并安装到指定路径。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:98
-msgid "`MMLU` dataset."
-msgstr "`MMLU` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:107
-msgid "`GPQA` dataset."
-msgstr "`GPQA` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:116
-msgid "`MATH` dataset."
-msgstr "`MATH` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:125
-msgid "`LiveCodeBench` dataset."
-msgstr "`LiveCodeBench` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:133
-msgid "`AIME 2024` dataset."
-msgstr "`AIME 2024` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:144
-msgid "`GSM8K` dataset."
-msgstr "`GSM8K` 数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:153
-msgid "Configuration"
-msgstr "配置"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:155
-msgid ""
-"Update the file "
-"`benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`."
-" There are several arguments that you should update according to your "
-"environment."
-msgstr ""
-"更新文件 "
-"`benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`。有几个参数需要根据您的环境进行更新。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:158
-msgid ""
-"`attr`: Identifier for the inference backend type, fixed as `service` "
-"(serving-based inference) or `local` (local model)."
-msgstr "`attr`：推理后端类型的标识符，固定为 `service`（基于服务的推理）或 `local`（本地模型）。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:159
-msgid "`type`: Used to select different backend API types."
-msgstr "`type`：用于选择不同的后端 API 类型。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:160
-msgid ""
-"`abbr`: Unique identifier for a local task, used to distinguish between "
-"multiple tasks."
-msgstr "`abbr`：本地任务的唯一标识符，用于区分多个任务。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:161
-msgid "`path`: Update to your model weight path."
-msgstr "`path`：更新为您的模型权重路径。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:162
-msgid "`model`: Update to your model name in vLLM."
-msgstr "`model`：更新为您的 vLLM 中的模型名称。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:163
-msgid "`host_ip` and `host_port`: Update to your vLLM server ip and port."
-msgstr "`host_ip` 和 `host_port`：更新为您的 vLLM 服务器的 IP 和端口。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:164
-msgid ""
-"`max_out_len`: Note `max_out_len` + LLM input length should be less than "
-"`max_model_len`(config in your vllm server), `32768` will be suitable for"
-" most datasets."
-msgstr ""
-"`max_out_len`：注意 `max_out_len` + LLM 输入长度应小于 `max_model_len`（在您的 vllm "
-"服务器中配置），`32768` 适用于大多数数据集。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:165
-msgid "`batch_size`: Update according to your dataset."
-msgstr "`batch_size`：根据您的数据集进行更新。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:166
-msgid "`temperature`: Update inference argument."
-msgstr "`temperature`：更新推理参数。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:199
-msgid "Execute Accuracy Evaluation"
-msgstr "执行精度评估"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:201
-msgid "Run the following code to execute different accuracy evaluation."
-msgstr "运行以下代码以执行不同的精度评估。"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:224
-msgid ""
-"After each dataset execution, you can get the result from saved files "
-"such as `outputs/default/20250628_151326`, there is an example as "
-"follows:"
-msgstr "每个数据集执行后，您可以从保存的文件（例如 `outputs/default/20250628_151326`）中获取结果，示例如下："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:249
-msgid "Execute Performance Evaluation"
-msgstr "执行性能评估"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:251
-msgid "Text-only benchmarks:"
-msgstr "纯文本基准测试："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:273
-msgid "Multi-modal benchmarks (text + images):"
-msgstr "多模态基准测试（文本 + 图像）："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:280
-msgid ""
-"After execution, you can get the result from saved files, there is an "
-"example as follows:"
-msgstr "执行后，您可以从保存的文件中获取结果，示例如下："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:300
-msgid "3. Troubleshooting"
-msgstr "3.故障排除"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:302
-msgid "Invalid Image Path Error"
-msgstr "无效图像路径错误"
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:304
-msgid "If you download the TextVQA dataset following the AISBench documentation:"
-msgstr "如果您按照 AISBench 文档下载 TextVQA 数据集："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:316
-msgid "you may encounter the following error:"
-msgstr "您可能会遇到以下错误："
-
-#: ../../source/developer_guide/evaluation/using_ais_bench.md:322
-msgid ""
-"You need to manually replace the dataset image paths with absolute paths,"
-" changing `/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` "
-"to the actual absolute directory where the images are stored:"
-msgstr ""
-"您需要手动将数据集图像路径替换为绝对路径，将 "
-"`/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` "
-"更改为图像存储的实际绝对目录："
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po
@@ -1,110 +1,112 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
+#
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:1
+#: ../../developer_guide/evaluation/using_evalscope.md:1
 msgid "Using EvalScope"
 msgstr "使用 EvalScope"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:3
+#: ../../developer_guide/evaluation/using_evalscope.md:3
 msgid ""
-"This document will guide you through model inference stress testing and "
-"accuracy testing using "
-"[EvalScope](https://github.com/modelscope/evalscope)."
+"This document will guide you have model inference stress testing and "
+"accuracy testing using [EvalScope](https://github.com/modelscope/evalscope)."
 msgstr ""
 "本文档将指导您如何使用 [EvalScope](https://github.com/modelscope/evalscope) "
 "进行模型推理压力测试和精度测试。"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:5
-msgid "1. Online server"
-msgstr "1.在线服务器"
+#: ../../developer_guide/evaluation/using_evalscope.md:5
+msgid "1. Online serving"
+msgstr "1. 在线服务"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:7
+#: ../../developer_guide/evaluation/using_evalscope.md:7
 msgid "You can run docker container to start the vLLM server on a single NPU:"
-msgstr "你可以运行 Docker 容器，在单个 NPU 上启动 vLLM 服务器："
+msgstr "你可以运行 docker 容器，在单个 NPU 上启动 vLLM 服务器："

-#: ../../source/developer_guide/evaluation/using_evalscope.md:35
+#: ../../developer_guide/evaluation/using_evalscope.md:34
+msgid "If your service start successfully, you can see the info shown below:"
+msgstr "如果你的服务启动成功，你会看到如下所示的信息："
+
+#: ../../developer_guide/evaluation/using_evalscope.md:42
 msgid ""
-"If the vLLM server is started successfully, you can see information shown"
-" below:"
-msgstr "如果 vLLM 服务器启动成功，你将看到如下所示的信息："
+"Once your server is started, you can query the model with input prompts in "
+"new terminal:"
+msgstr "一旦你的服务器启动后，你可以在新的终端中用输入提示词查询模型："

-#: ../../source/developer_guide/evaluation/using_evalscope.md:43
-msgid ""
-"Once your server is started, you can query the model with input prompts "
-"in a new terminal:"
-msgstr "服务器启动后，你可以在新的终端中使用输入提示词查询模型："
-
-#: ../../source/developer_guide/evaluation/using_evalscope.md:56
+#: ../../developer_guide/evaluation/using_evalscope.md:55
 msgid "2. Install EvalScope using pip"
-msgstr "2.使用 pip 安装 EvalScope"
+msgstr "2. 使用 pip 安装 EvalScope"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:58
-msgid "You can install EvalScope as follows:"
-msgstr "你可以通过以下方式安装 EvalScope："
+#: ../../developer_guide/evaluation/using_evalscope.md:57
+msgid "You can install EvalScope by using:"
+msgstr "你可以使用以下方式安装 EvalScope："

-#: ../../source/developer_guide/evaluation/using_evalscope.md:66
-msgid "3. Run GSM8K using EvalScope for accuracy testing"
-msgstr "3.使用 EvalScope 运行 GSM8K 进行精度测试"
+#: ../../developer_guide/evaluation/using_evalscope.md:65
+msgid "3. Run gsm8k accuracy test using EvalScope"
+msgstr "3. 使用 EvalScope 运行 gsm8k 准确率测试"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:68
+#: ../../developer_guide/evaluation/using_evalscope.md:67
+msgid "You can `evalscope eval` run gsm8k accuracy test:"
+msgstr "你可以使用 `evalscope eval` 运行 gsm8k 准确率测试："
+
+#: ../../developer_guide/evaluation/using_evalscope.md:78
+#: ../../developer_guide/evaluation/using_evalscope.md:114
+msgid "After 1-2 mins, the output is as shown below:"
+msgstr "1-2 分钟后，输出如下所示："
+
+#: ../../developer_guide/evaluation/using_evalscope.md:88
 msgid ""
-"You can use `evalscope eval` to run GSM8K (a grade-school math benchmark "
-"dataset) for accuracy testing:"
-msgstr "你可以使用 `evalscope eval` 运行 GSM8K（一个小学数学基准数据集）进行精度测试："
-
-#: ../../source/developer_guide/evaluation/using_evalscope.md:80
-#: ../../source/developer_guide/evaluation/using_evalscope.md:117
-msgid "After 1 to 2 minutes, the output is shown below:"
-msgstr "1 到 2 分钟后，输出结果如下所示："
-
-#: ../../source/developer_guide/evaluation/using_evalscope.md:90
-msgid ""
-"See more details in [EvalScope doc - Model API Service "
-"Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html"
-"#model-api-service-evaluation)."
+"See more detail in: [EvalScope doc - Model API Service "
+"Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-"
+"api-service-evaluation)."
 msgstr ""
-"更多详情请参阅 [EvalScope 文档 - 模型 API "
-"服务评估](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html"
-"#model-api-service-evaluation)。"
+"更多详情请见：[EvalScope 文档 - 模型 API "
+"服务评测](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-"
+"api-service-evaluation)。"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:92
+#: ../../developer_guide/evaluation/using_evalscope.md:90
 msgid "4. Run model inference stress testing using EvalScope"
-msgstr "4.使用 EvalScope 运行模型推理压力测试"
+msgstr "4. 使用 EvalScope 运行模型推理压力测试"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:94
+#: ../../developer_guide/evaluation/using_evalscope.md:92
 msgid "Install EvalScope[perf] using pip"
 msgstr "使用 pip 安装 EvalScope[perf]"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:100
+#: ../../developer_guide/evaluation/using_evalscope.md:98
 msgid "Basic usage"
 msgstr "基本用法"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:102
-msgid "You can use `evalscope perf` to run perf testing:"
+#: ../../developer_guide/evaluation/using_evalscope.md:100
+msgid "You can use `evalscope perf` run perf test:"
 msgstr "你可以使用 `evalscope perf` 运行性能测试："

-#: ../../source/developer_guide/evaluation/using_evalscope.md:115
+#: ../../developer_guide/evaluation/using_evalscope.md:112
 msgid "Output results"
 msgstr "输出结果"

-#: ../../source/developer_guide/evaluation/using_evalscope.md:176
+#: ../../developer_guide/evaluation/using_evalscope.md:173
 msgid ""
-"See more detail in [EvalScope doc - Model Inference Stress "
-"Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html"
-"#basic-usage)."
+"See more detail in: [EvalScope doc - Model Inference Stress "
+"Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-"
+"usage)."
 msgstr ""
-"更多详情请参阅 [EvalScope 文档 - "
-"模型推理压力测试](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html"
-"#basic-usage)。"
+"更多详情见：[EvalScope 文档 - "
+"模型推理压力测试](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-"
+"usage)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po
@@ -4,124 +4,62 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:1
+#: ../../developer_guide/evaluation/using_lm_eval.md:1
 msgid "Using lm-eval"
 msgstr "使用 lm-eval"

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:3
-msgid "This document guides you to conduct accuracy testing using [lm-eval][1]."
-msgstr "本文档指导您如何使用 [lm-eval][1] 进行准确率测试。"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:5
-msgid "Online Server"
-msgstr "在线服务器"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:7
-msgid "1. Start the vLLM server"
-msgstr "1.启动 vLLM 服务器"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:9
-msgid "You can run docker container to start the vLLM server on a single NPU:"
-msgstr "您可以在单个 NPU 上运行 Docker 容器来启动 vLLM 服务器："
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:38
-msgid "The vLLM server is started successfully, if you see logs as below:"
-msgstr "如果您看到如下日志，则表示 vLLM 服务器已成功启动："
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:46
+#: ../../developer_guide/evaluation/using_lm_eval.md:2
 msgid ""
-"2. Run GSM8K using the vLLM server (curl) and then run lm-eval for "
-"accuracy testing"
+"This document will guide you have a accuracy testing using [lm-"
+"eval](https://github.com/EleutherAI/lm-evaluation-harness)."
 msgstr ""
-"2.使用 vLLM 服务器（curl）运行 GSM8K，然后运行 lm-eval 进行准确率测试"
+"本文将指导你如何使用 [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) "
+"进行准确率测试。"

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:48
-msgid "You can query the result with input prompts:"
-msgstr "您可以使用输入提示词查询结果："
+#: ../../developer_guide/evaluation/using_lm_eval.md:4
+msgid "1. Run docker container"
+msgstr "1. 运行 docker 容器"

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:75
-msgid "The output format matches the following:"
-msgstr "输出格式符合以下形式："
+#: ../../developer_guide/evaluation/using_lm_eval.md:6
+msgid "You can run docker container on a single NPU:"
+msgstr "你可以在单个NPU上运行docker容器："

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:105
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:177
-msgid "Install lm-eval in the container:"
-msgstr "在容器中安装 lm-eval："
+#: ../../developer_guide/evaluation/using_lm_eval.md:33
+msgid "2. Run ceval accuracy test using lm-eval"
+msgstr "2. 使用 lm-eval 运行 ceval 准确性测试"

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:114
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:186
-msgid ""
-"The Docker container is launched with `VLLM_USE_MODELSCOPE=True`, which "
-"may cause lm-eval to download datasets from ModelScope instead of "
-"HuggingFace. Setting `USE_MODELSCOPE_HUB=0` disables this behavior so "
-"that lm-eval can fetch datasets from HuggingFace correctly."
-msgstr ""
-"Docker 容器以 `VLLM_USE_MODELSCOPE=True` 启动，这可能导致 lm-eval 从 ModelScope 而非 "
-"HuggingFace 下载数据集。设置 `USE_MODELSCOPE_HUB=0` 可禁用此行为，使 lm-eval 能够正确从 "
-"HuggingFace 获取数据集。"
+#: ../../developer_guide/evaluation/using_lm_eval.md:34
+msgid "Install lm-eval in the container."
+msgstr "在容器中安装 lm-eval。"

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:120
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:192
+#: ../../developer_guide/evaluation/using_lm_eval.md:39
 msgid "Run the following command:"
 msgstr "运行以下命令："

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:131
-msgid "After 30 minutes, the output is as shown below:"
-msgstr "30 分钟后，输出如下所示："
+#: ../../developer_guide/evaluation/using_lm_eval.md:50
+msgid "After 1-2 mins, the output is as shown below:"
+msgstr "1-2 分钟后，输出如下所示："

-#: ../../source/developer_guide/evaluation/using_lm_eval.md:143
-msgid "Offline Server"
-msgstr "离线服务器"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:145
-msgid "1. Run docker container"
-msgstr "1.运行 docker 容器"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:147
-msgid "You can run docker container on a single NPU:"
-msgstr "您可以在单个 NPU 上运行 docker 容器："
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:175
-msgid "2. Run GSM8K using lm-eval for accuracy testing"
-msgstr "2.使用 lm-eval 运行 GSM8K 进行准确率测试"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:203
-msgid "After 1 to 2 minutes, the output is shown below:"
-msgstr "1 到 2 分钟后，输出如下所示："
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:215
-msgid "Use Offline Datasets"
-msgstr "使用离线数据集"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:217
+#: ../../developer_guide/evaluation/using_lm_eval.md:62
 msgid ""
-"Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples,"
-" and you can see more from [using-local-datasets][2]."
-msgstr "以 GSM8K（单数据集）和 MMLU（多学科数据集）为例，您可以在 [using-local-datasets][2] 中查看更多信息。"
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:231
-msgid "Set [gsm8k.yaml][3] as follows:"
-msgstr "按如下方式设置 [gsm8k.yaml][3]："
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:294
-msgid "Set [_default_template_yaml][4] as follows:"
-msgstr "按如下方式设置 [_default_template_yaml][4]："
-
-#: ../../source/developer_guide/evaluation/using_lm_eval.md:317
-msgid "You can see more usage on [Lm-eval Docs][5]."
-msgstr "您可以在 [Lm-eval 文档][5] 中查看更多用法。"
+"You can see more usage on [Lm-eval Docs](https://github.com/EleutherAI/lm-"
+"evaluation-harness/blob/main/docs/README.md)."
+msgstr ""
+"你可以在 [Lm-eval 文档](https://github.com/EleutherAI/lm-evaluation-"
+"harness/blob/main/docs/README.md) 上查看更多用法。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po
@@ -4,81 +4,80 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/evaluation/using_opencompass.md:1
+#: ../../developer_guide/evaluation/using_opencompass.md:1
 msgid "Using OpenCompass"
 msgstr "使用 OpenCompass"

-#: ../../source/developer_guide/evaluation/using_opencompass.md:3
+#: ../../developer_guide/evaluation/using_opencompass.md:2
 msgid ""
-"This document guides you to conduct accuracy testing using "
+"This document will guide you have a accuracy testing using "
 "[OpenCompass](https://github.com/open-compass/opencompass)."
 msgstr ""
 "本文档将指导你如何使用 [OpenCompass](https://github.com/open-compass/opencompass) "
 "进行准确率测试。"

-#: ../../source/developer_guide/evaluation/using_opencompass.md:5
-msgid "1. Online Server"
-msgstr "1.在线服务"
+#: ../../developer_guide/evaluation/using_opencompass.md:4
+msgid "1. Online Serving"
+msgstr "1. 在线服务"

-#: ../../source/developer_guide/evaluation/using_opencompass.md:7
-msgid "You can run a docker container to start the vLLM server on a single NPU:"
-msgstr "你可以运行一个 Docker 容器，在单个 NPU 上启动 vLLM 服务器："
+#: ../../developer_guide/evaluation/using_opencompass.md:6
+msgid "You can run docker container to start the vLLM server on a single NPU:"
+msgstr "你可以运行 docker 容器，在单个 NPU 上启动 vLLM 服务器："

-#: ../../source/developer_guide/evaluation/using_opencompass.md:35
-msgid "The vLLM server is started successfully, if you see information as below:"
-msgstr "如果看到如下信息，则表明 vLLM 服务器已成功启动："
+#: ../../developer_guide/evaluation/using_opencompass.md:32
+msgid "If your service start successfully, you can see the info shown below:"
+msgstr "如果你的服务启动成功，你会看到如下所示的信息："

-#: ../../source/developer_guide/evaluation/using_opencompass.md:43
+#: ../../developer_guide/evaluation/using_opencompass.md:39
 msgid ""
-"Once your server is started, you can query the model with input prompts "
-"in a new terminal."
-msgstr "服务器启动后，你可以在新的终端中使用输入提示词来查询模型。"
+"Once your server is started, you can query the model with input prompts in "
+"new terminal:"
+msgstr "一旦你的服务器启动后，你可以在新的终端中用输入提示词查询模型："

-#: ../../source/developer_guide/evaluation/using_opencompass.md:56
-msgid ""
-"2. Run C-Eval (a Chinese language model evaluation benchmark) using "
-"OpenCompass for accuracy testing"
-msgstr "2.使用 OpenCompass 运行 C-Eval 进行准确率测试"
+#: ../../developer_guide/evaluation/using_opencompass.md:51
+msgid "2. Run ceval accuracy test using OpenCompass"
+msgstr "2. 使用 OpenCompass 运行 ceval 准确率测试"

-#: ../../source/developer_guide/evaluation/using_opencompass.md:58
+#: ../../developer_guide/evaluation/using_opencompass.md:52
 msgid ""
 "Install OpenCompass and configure the environment variables in the "
-"container:"
-msgstr "在容器中安装 OpenCompass 并配置环境变量："
+"container."
+msgstr "在容器中安装 OpenCompass 并配置环境变量。"

-#: ../../source/developer_guide/evaluation/using_opencompass.md:70
+#: ../../developer_guide/evaluation/using_opencompass.md:64
 msgid ""
-"Add the following content to "
-"`opencompass/configs/eval_vllm_ascend_demo.py`:"
-msgstr "将以下内容添加到 `opencompass/configs/eval_vllm_ascend_demo.py` 文件中："
+"Add `opencompass/configs/eval_vllm_ascend_demo.py` with the following "
+"content:"
+msgstr "添加 `opencompass/configs/eval_vllm_ascend_demo.py`，内容如下："

-#: ../../source/developer_guide/evaluation/using_opencompass.md:110
+#: ../../developer_guide/evaluation/using_opencompass.md:104
 msgid "Run the following command:"
 msgstr "运行以下命令："

-#: ../../source/developer_guide/evaluation/using_opencompass.md:116
-msgid "After 1 to 2 minutes, the output is shown below:"
-msgstr "1 到 2 分钟后，输出结果如下所示："
+#: ../../developer_guide/evaluation/using_opencompass.md:110
+msgid "After 1-2 mins, the output is as shown below:"
+msgstr "1-2 分钟后，输出如下所示："

-#: ../../source/developer_guide/evaluation/using_opencompass.md:126
+#: ../../developer_guide/evaluation/using_opencompass.md:120
 msgid ""
 "You can see more usage on [OpenCompass "
 "Docs](https://opencompass.readthedocs.io/en/latest/index.html)."
 msgstr ""
 "你可以在 [OpenCompass "
-"文档](https://opencompass.readthedocs.io/en/latest/index.html) 中查看更多用法。"
+"文档](https://opencompass.readthedocs.io/en/latest/index.html) 查看更多用法。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/index.po
@@ -20,12 +20,12 @@ msgstr ""
 "Plural-Forms: nplurals=1; plural=0;\n"
 "Generated-By: Babel 2.17.0\n"

-#: ../../developer_guide/Design_Documents/index.md:1
-#: ../../developer_guide/Design_Documents/index.md:5
-msgid "Design Documents"
-msgstr "设计文档"
+#: ../../developer_guide/feature_guide/index.md:1
+#: ../../developer_guide/feature_guide/index.md:5
+msgid "Feature Guide"
+msgstr "功能指南"

-#: ../../developer_guide/Design_Documents/index.md:3
+#: ../../developer_guide/feature_guide/index.md:3
 msgid ""
 "This section provides an overview of the features implemented in vLLM "
 "Ascend. Developers can refer to this guide to understand how vLLM Ascend "
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/feature_guide/patch.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/feature_guide/patch.po
@@ -0,0 +1,248 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend\n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Language: zh_CN\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"
+
+#: ../../developer_guide/feature_guide/patch.md:1
+msgid "Patch in vLLM Ascend"
+msgstr "在 vLLM Ascend 中的补丁"
+
+#: ../../developer_guide/feature_guide/patch.md:3
+msgid ""
+"vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM "
+"and vLLM Ascend is different, and the hardware limitation in some case, we "
+"need to patch some code in vLLM to make it compatible with vLLM Ascend."
+msgstr ""
+"vLLM Ascend 是 vLLM 的一个平台插件。由于 vLLM 和 vLLM Ascend "
+"的发布周期不同，并且在某些情况下存在硬件限制，我们需要对 vLLM 进行一些代码补丁，以使其能够兼容 vLLM Ascend。"
+
+#: ../../developer_guide/feature_guide/patch.md:5
+msgid ""
+"In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to "
+"address the change for vLLM."
+msgstr "在 vLLM Ascend 代码中，我们提供了一个补丁模块 `vllm_ascend/patch` 用于应对 vLLM 的变更。"
+
+#: ../../developer_guide/feature_guide/patch.md:7
+msgid "Principle"
+msgstr "原理"
+
+#: ../../developer_guide/feature_guide/patch.md:9
+msgid ""
+"We should keep in mind that Patch is not the best way to make vLLM Ascend "
+"compatible. It's just a temporary solution. The best way is to contribute "
+"the change to vLLM to make it compatible with vLLM Ascend originally. In "
+"vLLM Ascend, we have the basic principle for Patch strategy:"
+msgstr ""
+"我们需要记住，Patch 不是让 vLLM 兼容 Ascend 的最佳方式，这只是一个临时的解决方案。最好的方法是将修改贡献到 vLLM 项目中，从而让"
+" vLLM 原生支持 Ascend。对于 vLLM Ascend，我们对 Patch 策略有一个基本原则："
+
+#: ../../developer_guide/feature_guide/patch.md:11
+msgid "Less is more. Please do not patch unless it's the only way currently."
+msgstr "少即是多。请不要打补丁，除非这是目前唯一的方法。"
+
+#: ../../developer_guide/feature_guide/patch.md:12
+msgid ""
+"Once a patch is added, it's required to describe the future plan for "
+"removing the patch."
+msgstr "一旦补丁被添加，必须说明将来移除该补丁的计划。"
+
+#: ../../developer_guide/feature_guide/patch.md:13
+msgid "Anytime, clean the patch code is welcome."
+msgstr "任何时候，欢迎清理补丁代码。"
+
+#: ../../developer_guide/feature_guide/patch.md:15
+msgid "How it works"
+msgstr "工作原理"
+
+#: ../../developer_guide/feature_guide/patch.md:17
+msgid "In `vllm_ascend/patch`, you can see the code structure as follows:"
+msgstr "在 `vllm_ascend/patch` 目录中，你可以看到如下代码结构："
+
+#: ../../developer_guide/feature_guide/patch.md:33
+msgid ""
+"**platform**: The patch code in this directory is for patching the code in "
+"vLLM main process. It's called by "
+"`vllm_ascend/platform::NPUPlatform::pre_register_and_update` very early when"
+" vLLM is initialized."
+msgstr ""
+"**platform**：此目录下的补丁代码用于修补 vLLM 主进程中的代码。当 vLLM 初始化时，会在很早的阶段由 "
+"`vllm_ascend/platform::NPUPlatform::pre_register_and_update` 调用。"
+
+#: ../../developer_guide/feature_guide/patch.md:34
+msgid ""
+"For online mode, vLLM process calls the platform patch here "
+"`vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing "
+"the cli args."
+msgstr ""
+"对于在线模式，vLLM 进程在解析命令行参数时，会在 "
+"`vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` 这里调用平台补丁。"
+
+#: ../../developer_guide/feature_guide/patch.md:35
+msgid ""
+"For offline mode, vLLM process calls the platform patch here "
+"`vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when "
+"parsing the input parameters."
+msgstr ""
+"对于离线模式，vLLM 进程在解析输入参数时，会在此处调用平台补丁 "
+"`vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config`。"
+
+#: ../../developer_guide/feature_guide/patch.md:36
+msgid ""
+"**worker**: The patch code in this directory is for patching the code in "
+"vLLM worker process. It's called by "
+"`vllm_ascend/worker/worker::NPUWorker::__init__` when the vLLM worker "
+"process is initialized."
+msgstr ""
+"**worker**：此目录中的补丁代码用于修补 vLLM worker 进程中的代码。在初始化 vLLM worker 进程时，会被 "
+"`vllm_ascend/worker/worker::NPUWorker::__init__` 调用。"
+
+#: ../../developer_guide/feature_guide/patch.md:37
+msgid ""
+"For both online and offline mode, vLLM engine core process calls the worker "
+"patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` "
+"when initializing the worker process."
+msgstr ""
+"无论是在线还是离线模式，vLLM 引擎核心进程在初始化 worker 进程时，都会在这里调用 worker "
+"补丁：`vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker`。"
+
+#: ../../developer_guide/feature_guide/patch.md:39
+msgid ""
+"In both **platform** and **worker** folder, there are several patch modules."
+" They are used for patching different version of vLLM."
+msgstr "在 **platform** 和 **worker** 文件夹中都有一些补丁模块。它们用于修补不同版本的 vLLM。"
+
+#: ../../developer_guide/feature_guide/patch.md:41
+msgid ""
+"`patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is "
+"always the nearest version of vLLM. Once vLLM is released, we will drop this"
+" patch module and bump to a new version. For example, `patch_0_9_2` is used "
+"for patching vLLM 0.9.2."
+msgstr ""
+"`patch_0_9_2`：此模块用于修补 vLLM 0.9.2。该版本始终对应于 vLLM 的最近版本。一旦 vLLM "
+"发布新版本，我们将移除此补丁模块并升级到新版本。例如，`patch_0_9_2` 就是用于修补 vLLM 0.9.2 的。"
+
+#: ../../developer_guide/feature_guide/patch.md:42
+msgid ""
+"`patch_main`: This module is used for patching the code in vLLM main branch."
+msgstr "`patch_main`：该模块用于修补 vLLM 主分支代码。"
+
+#: ../../developer_guide/feature_guide/patch.md:43
+msgid ""
+"`patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM "
+"main branch."
+msgstr "`patch_common`：此模块用于同时修补 vLLM 0.9.2 版本和 vLLM 主分支。"
+
+#: ../../developer_guide/feature_guide/patch.md:45
+msgid "How to write a patch"
+msgstr "如何撰写补丁"
+
+#: ../../developer_guide/feature_guide/patch.md:47
+msgid ""
+"Before writing a patch, following the principle above, we should patch the "
+"least code. If it's necessary, we can patch the code in either **platform** "
+"and **worker** folder. Here is an example to patch `distributed` module in "
+"vLLM."
+msgstr ""
+"在编写补丁之前，遵循上述原则，我们应尽量修改最少的代码。如果有必要，我们可以修改 **platform** 和 **worker** "
+"文件夹中的代码。下面是一个在 vLLM 中修改 `distributed` 模块的示例。"
+
+#: ../../developer_guide/feature_guide/patch.md:49
+msgid ""
+"Decide which version of vLLM we should patch. For example, after analysis, "
+"here we want to patch both 0.9.2 and main of vLLM."
+msgstr "决定我们应该修补哪个版本的 vLLM。例如，经过分析后，这里我们想要同时修补 vLLM 的 0.9.2 版和主分支（main）。"
+
+#: ../../developer_guide/feature_guide/patch.md:50
+msgid ""
+"Decide which process we should patch. For example, here `distributed` "
+"belongs to the vLLM main process, so we should patch `platform`."
+msgstr "决定我们应该修补哪个进程。例如，这里 `distributed` 属于 vLLM 主进程，所以我们应该修补 `platform`。"
+
+#: ../../developer_guide/feature_guide/patch.md:51
+#, python-brace-format
+msgid ""
+"Create the patch file in the right folder. The file should be named as "
+"`patch_{module_name}.py`. The example here is "
+"`vllm_ascend/patch/platform/patch_common/patch_distributed.py`."
+msgstr ""
+"在正确的文件夹中创建补丁文件。文件应命名为 `patch_{module_name}.py`。此处的示例是 "
+"`vllm_ascend/patch/platform/patch_common/patch_distributed.py`。"
+
+#: ../../developer_guide/feature_guide/patch.md:52
+msgid "Write your patch code in the new file. Here is an example:"
+msgstr "在新文件中编写你的补丁代码。以下是一个示例："
+
+#: ../../developer_guide/feature_guide/patch.md:62
+msgid ""
+"Import the patch file in `__init__.py`. In this example, add `import "
+"vllm_ascend.patch.platform.patch_common.patch_distributed` into "
+"`vllm_ascend/patch/platform/patch_common/__init__.py`."
+msgstr ""
+"在 `__init__.py` 中导入补丁文件。在这个示例中，将 `import "
+"vllm_ascend.patch.platform.patch_common.patch_distributed` 添加到 "
+"`vllm_ascend/patch/platform/patch_common/__init__.py` 中。"
+
+#: ../../developer_guide/feature_guide/patch.md:63
+msgid ""
+"Add the description of the patch in `vllm_ascend/patch/__init__.py`. The "
+"description format is as follows:"
+msgstr "在 `vllm_ascend/patch/__init__.py` 中添加补丁的描述。描述格式如下："
+
+#: ../../developer_guide/feature_guide/patch.md:77
+msgid ""
+"Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should "
+"contain the Unit Test and E2E Test as well. You can find more details in "
+"[test guide](../contribution/testing.md)"
+msgstr ""
+"添加单元测试和端到端（E2E）测试。在 vLLM Ascend 中新增的任何代码也应包含单元测试和端到端测试。更多详情请参见 "
+"[测试指南](../contribution/testing.md)。"
+
+#: ../../developer_guide/feature_guide/patch.md:80
+msgid "Limitation"
+msgstr "限制"
+
+#: ../../developer_guide/feature_guide/patch.md:81
+msgid ""
+"In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore "
+"process and Worker process. Now vLLM Ascend only support patch the code in "
+"Main process and Worker process by default. If you want to patch the code "
+"runs in EngineCore process, you should patch EngineCore process entirely "
+"during setup, the entry code is here `vllm.v1.engine.core`. Please override "
+"`EngineCoreProc` and `DPEngineCoreProc` entirely."
+msgstr ""
+"在 V1 引擎中，vLLM 会启动三种类型的进程：主进程、EngineCore 进程和 Worker 进程。现在 vLLM Ascend "
+"默认只支持在主进程和 Worker 进程中打补丁代码。如果你想要在 EngineCore 进程中打补丁，你需要在设置阶段对 EngineCore "
+"进程整体打补丁，入口代码在 `vllm.v1.engine.core`。请完全重写 `EngineCoreProc` 和 "
+"`DPEngineCoreProc`。"
+
+#: ../../developer_guide/feature_guide/patch.md:82
+msgid ""
+"If you are running an edited vLLM code, the version of the vLLM may be "
+"changed automatically. For example, if you runs an edited vLLM based on "
+"v0.9.n, the version of vLLM may be change to v0.9.nxxx, in this case, the "
+"patch for v0.9.n in vLLM Ascend would not work as expect, because that vLLM "
+"Ascend can't distinguish the version of vLLM you're using. In this case, you"
+" can set the environment variable `VLLM_VERSION` to specify the version of "
+"vLLM you're using, then the patch for v0.9.2 should work."
+msgstr ""
+"如果你运行的是经过编辑的 vLLM 代码，vLLM 的版本可能会被自动更改。例如，如果你基于 v0.9.n 运行了编辑后的 vLLM，vLLM "
+"的版本可能会变为 v0.9.nxxx，在这种情况下，vLLM Ascend 的 v0.9.n 补丁将无法正常工作，因为 vLLM Ascend "
+"无法区分你所使用的 vLLM 版本。这时，你可以设置环境变量 `VLLM_VERSION` 来指定你所使用的 vLLM 版本，这样对 v0.9.2 "
+"的补丁就应该可以正常工作。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/msprobe_guide.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/msprobe_guide.po
@@ -6,187 +6,183 @@
 #
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: EMAIL@ADDRESS\n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: 2025-11-21 10:31+0000\n"
+"POT-Creation-Date: 2025-11-21 10:19+0800\n"
+"PO-Revision-Date: 2025-11-21 10:31\n"
 "Last-Translator: Codex <codex@example.com>\n"
-"Language: zh_CN\n"
 "Language-Team: Chinese (Simplified) <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:1
+#: ../../developer_guide/performance_and_performance_and_debug/msprobe_guide.md
 msgid "MSProbe Debugging Guide"
 msgstr "MSProbe 调试指南"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:3
+#: ../../developer_guide/performance_and_performance_and_debug/msprobe_guide.md
 msgid ""
-"During inference or training runs we often encounter accuracy anomalies "
-"such as outputs drifting away from the expectation, unstable numerical "
-"behavior (NaN/Inf), or predictions that no longer match the labels. To "
-"pinpoint the root cause we have to monitor and capture intermediate data "
-"produced while the model executes—feature maps, weights, activations, and"
-" layer outputs. By capturing key tensors at specific stages, logging I/O "
-"pairs for the core layers, and retaining contextual metadata (prompts, "
-"tensor dtypes, hardware configuration, etc.), we can systematically trace"
-" where the accuracy degradation or numerical error started. This guide "
-"describes the end-to-end workflow for diagnosing accuracy issues for AI "
-"models (with a focus on vllm-ascend services): preparation, data capture,"
-" and analysis & verification."
+"During inference or training runs we often encounter accuracy anomalies such"
+" as outputs drifting away from the expectation, unstable numerical behavior "
+"(NaN/Inf), or predictions that no longer match the labels. To pinpoint the "
+"root cause we have to monitor and capture intermediate data produced while "
+"the model executes—feature maps, weights, activations, and layer outputs. By"
+" capturing key tensors at specific stages, logging I/O pairs for the core "
+"layers, and retaining contextual metadata (prompts, tensor dtypes, hardware "
+"configuration, etc.), we can systematically trace where the accuracy "
+"degradation or numerical error started. This guide describes the end-to-end "
+"workflow for diagnosing accuracy issues for AI models (with a focus on vllm-"
+"ascend services): preparation, data capture, and analysis & verification."
 msgstr ""
 "在推理或训练过程中，我们经常会遇到输出偏离预期、出现 NaN/Inf "
 "等数值不稳定现象，或者模型预测与标签不一致等精度异常。要定位根因，就必须监控并采集模型执行过程中的中间数据——例如特征图、权重、激活值及各层输出。通过在关键阶段捕获核心张量、记录核心层的输入输出对，并保留提示词、张量"
 " dtype、硬件配置等上下文元数据，我们可以系统追踪精度退化或数值错误的源头。本指南聚焦 vllm-ascend 服务，介绍 AI "
 "模型精度问题排查的完整流程：准备、数据采集以及分析与验证。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:5
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "0. Background Concepts"
 msgstr "0. 前置概念"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:7
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`msprobe` supports three accuracy levels:"
 msgstr "`msprobe` 支持三种精度级别："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:9
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"**L0**: dumps tensors at the module level and generates `construct.json` "
-"so that visualization tools can rebuild the network structure. A model or"
-" submodule handle must be passed in."
-msgstr ""
-"**L0**：在`nn.Module`级别保存`tensor`，并生成 `construct.json` "
-"以便可视化工具还原网络结构，需要传入模型或子模块句柄。"
+"**L0**: dumps tensors at the module level and generates `construct.json` so "
+"that visualization tools can rebuild the network structure. A model or "
+"submodule handle must be passed in."
+msgstr "**L0**：在`nn.Module`级别保存`tensor`，并生成 `construct.json` 以便可视化工具还原网络结构，需要传入模型或子模块句柄。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:10
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "**L1**: collects operator-level statistics only, which is suitable for "
 "lightweight troubleshooting."
 msgstr "**L1**：仅采集算子级统计信息，适合轻量排查。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:11
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"**mix**: captures both structural information and operator statistics, "
-"which is useful when you need both graph reconstruction and numerical "
+"**mix**: captures both structural information and operator statistics, which"
+" is useful when you need both graph reconstruction and numerical "
 "comparisons."
 msgstr "**mix**：同时获取结构信息与算子统计，适用于既要构图又要进行数值对比的场景。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:13
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "1. Prerequisites"
 msgstr "1. 前提条件"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:15
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "1.1 Install `msprobe`"
 msgstr "1.1 安装 `msprobe`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:17
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Install msprobe with pip:"
 msgstr "使用 pip 安装 msprobe："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:23
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "1.2 Visualization dependencies (optional)"
 msgstr "1.2 可视化依赖（可选）"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:25
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Install additional dependencies if you need to visualize the captured "
-"data."
+"Install additional dependencies if you need to visualize the captured data."
 msgstr "如需对采集的数据进行可视化，请安装以下依赖。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:27
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Install `tb_graph_ascend`:"
 msgstr "安装 `tb_graph_ascend`："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:33
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "2. Collecting Data with `msprobe`"
 msgstr "2. 使用 `msprobe` 采集数据"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:35
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"We generally follow a coarse-to-fine strategy when capturing data. First,"
-" identify the token where the issue shows up, and then decide which range"
-" needs to be sampled around that token. The typical workflow is described"
-" below."
+"We generally follow a coarse-to-fine strategy when capturing data. First "
+"identify the token where the issue shows up, and then decide which range "
+"needs to be sampled around that token. The typical workflow is described "
+"below."
 msgstr "采集通常遵循由粗到细的策略：先确定问题出现的 token，再围绕该 token 决定采样范围，常规流程如下。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:37
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "2.1 Prepare the dump configuration file"
 msgstr "2.1 准备 dump 配置文件"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:39
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Create a `config.json` that can be parsed by `PrecisionDebugger` and "
-"place it in an accessible path. Common fields are:"
+"Create a `config.json` that can be parsed by `PrecisionDebugger` and place "
+"it in an accessible path. Common fields are:"
 msgstr "创建可被 `PrecisionDebugger` 解析的 `config.json` 并放置在可访问路径，常见字段如下："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Field"
 msgstr "字段"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Description"
 msgstr "说明"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Required"
 msgstr "必填"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`task`"
 msgstr "`task`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Type of dump task. Common PyTorch values include `\"statistics\"` and "
-"`\"tensor\"`. A statistics task collects tensor statistics (mean, "
-"variance, max, min, etc.) while a tensor task captures arbitrary tensors."
+"`\"tensor\"`. A statistics task collects tensor statistics (mean, variance, "
+"max, min, etc.) while a tensor task captures arbitrary tensors."
 msgstr ""
 "dump 任务类型。PyTorch 常见取值包括 `\"statistics\"` 和 `\"tensor\"`：statistics "
 "任务采集张量统计量（均值、方差、最大值、最小值等），tensor 任务可采集任意张量。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Yes"
 msgstr "是"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`dump_path`"
 msgstr "`dump_path`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Directory where dump results are stored. When omitted, `msprobe` uses its"
-" default path."
+"Directory where dump results are stored. When omitted, `msprobe` uses its "
+"default path."
 msgstr "dump 结果保存目录，未配置时使用 `msprobe` 默认路径。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "No"
 msgstr "否"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`rank`"
 msgstr "`rank`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Ranks to sample. An empty list collects every rank. For single-card "
-"tasks, you must set this field to `[]`."
+"Ranks to sample. An empty list collects every rank. For single-card tasks "
+"you must set this field to `[]`."
 msgstr "指定需要采集的设备 rank，空列表表示全部 rank；单卡任务必须配置为 `[]`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`step`"
 msgstr "`step`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Token iteration(s) to sample. An empty list means every iteration."
 msgstr "指定采集的 token 轮次，空列表表示全部迭代。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`level`"
 msgstr "`level`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Dump level string (`\"L0\"`, `\"L1\"`, or `\"mix\"`). `L0` targets "
 "`nn.Module`, `L1` targets `torch.api`, and `mix` collects both."
@@ -194,354 +190,372 @@ msgstr ""
 "dump 级别字符串（`\"L0\"`、`\"L1\"`、`\"mix\"`），L0 面向 `nn.Module`，L1 面向 "
 "`torch.api`，mix 同时采集两者。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`async_dump`"
 msgstr "`async_dump`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Whether to enable asynchronous dump (supported for PyTorch "
 "`statistics`/`tensor` tasks). Defaults to `false`."
 msgstr "是否启用异步 dump（PyTorch `statistics`/`tensor` 任务可用），默认 `false`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`scope`"
 msgstr "`scope`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Module range to sample. An empty list collects every module."
 msgstr "指定需要采集的模块范围，空列表表示全部模块。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`list`"
 msgstr "`list`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Operator range to sample. An empty list collects every operator."
 msgstr "指定需要采集的算子范围，空列表表示全部算子。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:52
-msgid "To restrict the operators that are captured, configure the `list` block:"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid ""
+"To restrict the operators that are captured, configure the `list` block:"
 msgstr "如需进一步限定算子范围，请配置 `list`："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:54
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"`scope` (list[str]): In PyTorch PyNative scenarios this field restricts "
-"the dump range. Provide two module or API names that follow the tool's "
-"naming convention to lock a range; only data between the two names will "
-"be dumped. Examples:"
+"`scope` (list[str]): In PyTorch pynative scenarios this field restricts the "
+"dump range. Provide two module or API names that follow the tool's naming "
+"convention to lock a range; only data between the two names will be dumped. "
+"Examples:"
 msgstr ""
-"`scope`（list[str]）：在 PyTorch 动态图场景下用于限定 dump 区间。按照工具命名格式提供两个模块或 API 名称，只会"
-" dump 这一区间内的数据。示例："
+"`scope`（list[str]）：在 PyTorch 动态图场景下用于限定 dump 区间。按照工具命名格式提供两个模块或 API 名称，只会 "
+"dump 这一区间内的数据。示例："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:62
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"The `level` setting determines what can be provided—modules when "
-"`level=L0`, APIs when `level=L1`, and either modules or APIs when "
-"`level=mix`."
-msgstr "`level` 的取值决定可配置内容：`level=L0` 填模块名，`level=L1` 填 API 名，`level=mix` 则二者皆可。"
+"The `level` setting determines what can be provided—modules when `level=L0`,"
+" APIs when `level=L1`, and either modules or APIs when `level=mix`."
+msgstr ""
+"`level` 的取值决定可配置内容：`level=L0` 填模块名，`level=L1` 填 API 名，`level=mix` 则二者皆可。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:64
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`list` (list[str]): Custom operator list. Options include:"
 msgstr "`list`（list[str]）：用于自定义采集的算子范围，常见方式包括："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:65
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Supply the full names of specific APIs in PyTorch pynative scenarios to "
-"only dump those APIs. Example: `\"list\": [\"Tensor.permute.1.forward\", "
-"\"Tensor.transpose.2.forward\", \"Torch.relu.3.forward\"]`."
+"Supply the full names of specific APIs in PyTorch pynative scenarios to only"
+" dump those APIs. Example: `\"list\": [\"Tensor.permute.1.forward\", "
+"\"Tensor.transpose.2.forward\", \"Torch.relu.3.backward\"]`."
 msgstr ""
 "在 PyTorch 动态图场景中配置 API 全称，仅 dump 这些 API，例如 `\"list\": "
 "[\"Tensor.permute.1.forward\", \"Tensor.transpose.2.forward\", "
-"\"Torch.relu.3.forward\"]`。"
+"\"Torch.relu.3.backward\"]`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:66
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"When `level=mix`, you can provide module names so that the dump expands "
-"to everything produced while the module is running. Example: `\"list\": "
+"When `level=mix`, you can provide module names so that the dump expands to "
+"everything produced while the module is running. Example: `\"list\": "
 "[\"Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0\"]`."
 msgstr ""
 "当 `level=mix` 时可以填写模块名称，工具会在该模块执行期间展开并 dump 所有数据，例如 `\"list\": "
 "[\"Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0\"]`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:67
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Provide a substring such as `\"list\": [\"relu\"]` to dump every API "
-"whose name contains the substring. When `level=mix`, modules whose names "
-"contain the substring are also expanded."
+"Provide a substring such as `\"list\": [\"relu\"]` to dump every API whose "
+"name contains the substring. When `level=mix`, modules whose names contain "
+"the substring are also expanded."
 msgstr ""
 "也可以仅提供子串（如 `\"list\": [\"relu\"]`），会 dump 名称包含该字符串的 API，且 `level=mix` "
 "时会展开名称包含该字符串的模块。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:69
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Example configuration:"
 msgstr "示例配置："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:90
-msgid "3. Enable `msprobe` in vllm-ascend"
-msgstr "3. 在 vllm-ascend 中启用 `msprobe`"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid "2. Enable `msprobe` in vllm-ascend"
+msgstr "2. 在 vllm-ascend 中启用 `msprobe`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:92
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Start vLLM in eager mode by adding `--enforce-eager` (static-graph "
-"scenarios are not supported yet) and pass the config path through "
-"`--additional-config`:"
+"Start vLLM in eager mode by adding `--enforce-eager` (static-graph scenarios"
+" are not supported yet) and pass the config path through `--additional-"
+"config`:"
 msgstr ""
-"通过添加 `--enforce-eager` 以 eager 模式启动 vLLM（静态图暂不支持），并通过 `--additional-"
-"config` 传入配置路径："
+"通过添加 `--enforce-eager` 以 eager 模式启动 vLLM（静态图暂不支持），并通过 `--additional-config` "
+"传入配置路径："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:103
-msgid "4. Send requests and collect dumps"
-msgstr "4. 发送请求并采集 dump"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid "3. Send requests and collect dumps"
+msgstr "3. 发送请求并采集 dump"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:105
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Send inference requests as usual, for example:"
 msgstr "按常规方式发送推理请求，例如："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:118
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Each request drives the sequence `msprobe: start -> forward -> stop -> "
-"step`. The runner invokes `step()` on every code path, so you always get "
-"a complete dataset even if inference returns early."
+"Each request drives the sequence `msprobe: start -> forward/backward -> stop"
+" -> step`. The runner invokes `step()` on every code path, so you always get"
+" a complete dataset even if inference returns early."
 msgstr ""
-"每个请求都会执行 `msprobe: start -> forward -> stop -> step`，Runner "
+"每个请求都会执行 `msprobe: start -> forward/backward -> stop -> step`，Runner "
 "在所有路径都会调用 `step()`，即使推理提前结束也能拿到完整数据。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:120
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Dump files are written into `dump_path`. They usually contain:"
 msgstr "dump 文件写入 `dump_path`，通常包含："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:121
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Tensor files grouped by operator/module."
 msgstr "按算子或模块划分的张量文件。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:122
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "`dump.json`, which records metadata such as dtype, shape, min/max, and "
 "`requires_grad`."
 msgstr "描述 dtype、shape、最小/最大值以及 `requires_grad` 等信息的 `dump.json`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:123
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"`construct.json`, which is generated when `level` is `L0` or `mix` "
-"(required for visualization)."
+"`construct.json`, which is generated when `level` is `L0` or `mix` (required"
+" for visualization)."
 msgstr "当级别为 `L0` 或 `mix` 时生成的 `construct.json`（可视化必需）。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:125
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Example directory layout:"
 msgstr "目录结构示例："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:156
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+#, python-brace-format
 msgid ""
-"`rank`: Device ID. Each card writes its data to the corresponding "
-"`rank{ID}` directory. In non-distributed scenarios the directory is "
-"simply named `rank`."
+"`rank`: Device ID. Each card writes its data to the corresponding `rank{ID}`"
+" directory. In non-distributed scenarios the directory is simply named "
+"`rank`."
 msgstr "`rank`：设备 ID。每张卡写入对应的 `rank{ID}` 目录，非分布式场景目录名称为 `rank`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:157
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`dump_tensor_data`: Tensor payloads that were collected."
 msgstr "`dump_tensor_data`：采集到的张量数据。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:158
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"`dump.json`: Statistics for the forward data of each API or module, "
-"including names, dtype, shape, max, min, mean, L2 norm (square root of "
-"the L2 variance), and CRC-32 when `summary_mode=\"md5\"`. See [dump.json "
-"file description](#dumpjson-file-description) for details."
+"`dump.json`: Statistics for the forward/backward data of each API or module,"
+" including names, dtype, shape, max, min, mean, L2 norm (square root of the "
+"L2 variance), and CRC-32 when `summary_mode=\"md5\"`. See [dump.json file "
+"description](#dumpjson-file-description) for details."
 msgstr ""
-"`dump.json`：各 API 或模块前向数据的统计信息，包括名称、dtype、shape、最大值、最小值、平均值、L2 范数（L2 方差的平方根），以及在 `summary_mode=\"md5\"` 时的 CRC-32 值。详见 [dump.json 文件说明](#dumpjson-file-description)。"
+"`dump.json`：保存各 API 或模块前/反向数据统计，包含名称、dtype、shape、max、min、mean、L2 "
+"norm（平方根）以及在 `summary_mode=\"md5\"` 下的 CRC-32。详见 [dump.json file "
+"description](#dumpjson-file-description)。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:159
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"`dump_error_info.log`: Present only when the dump tool encountered an "
-"error and records the failure log."
-msgstr "`dump_error_info.log`：仅在 dump 工具遇到错误时生成，记录失败日志。"
+"`dump_error_info.log`: Present only when the dump tool encountered an error "
+"and records the failure log."
+msgstr "`dump_error_info.log`：仅在 dump 工具报错时生成，记录错误日志。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:160
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`stack.json`: Call stacks for APIs/modules."
-msgstr "`stack.json`：API/模块的调用栈信息。"
+msgstr "`stack.json`：API/Module 的调用栈信息。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:161
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"`construct.json`: Hierarchical structure description. Empty when "
-"`level=L1`."
-msgstr "`construct.json`：分层结构描述，当 `level=L1` 时为空。"
+"`construct.json`: Hierarchical structure description. Empty when `level=L1`."
+msgstr "`construct.json`：分层结构描述，`level=L1` 时为空。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:163
-msgid "5. Analyze the results"
-msgstr "5. 分析结果"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid "4. Analyze the results"
+msgstr "4. 分析结果"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:165
-msgid "5.1 Prerequisites"
-msgstr "5.1 前置条件"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid "4.1 Prerequisites"
+msgstr "4.1 前置条件"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:167
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"You typically need two dump datasets: one from the \"problem side\" (the "
-"run that exposes the accuracy or numerical error) and another from the "
+"You typically need two dump datasets: one from the \"problem side\" (the run"
+" that exposes the accuracy or numerical error) and another from the "
 "\"benchmark side\" (a good baseline). These datasets do not have to be "
-"identical—they can come from different branches, framework versions, or "
-"even alternative implementations (operator substitutions, different "
-"graph-optimization switches, etc.). As long as they use the same or "
-"similar inputs, hardware topology, and sampling points (step/token), "
-"`msprobe` can compare them and locate the divergent nodes. If you cannot "
-"find a perfectly clean benchmark, start by capturing the problem-side "
-"data, craft the smallest reproducible case by hand, and perform a self-"
-"comparison. Below we assume the problem dump is `problem_dump` and the "
-"benchmark dump is `bench_dump`."
+"identical—they can come from different branches, framework versions, or even"
+" alternative implementations (operator substitutions, different graph-"
+"optimization switches, etc.). As long as they use the same or similar "
+"inputs, hardware topology, and sampling points (step/token), `msprobe` can "
+"compare them and locate the divergent nodes. If you cannot find a perfectly "
+"clean benchmark, start by capturing the problem-side data, craft the "
+"smallest reproducible case by hand, and perform a self-comparison. Below we "
+"assume the problem dump is `problem_dump` and the benchmark dump is "
+"`bench_dump`."
 msgstr ""
-"通常需要两份 dump 数据集：一份来自“问题侧”（暴露精度或数值错误的运行），另一份来自“标杆侧”（良好的基线）。这些数据集不必完全相同——它们可以来自不同的分支、框架版本，甚至是替代实现（算子替换、不同的图优化开关等）。只要它们使用相同或相似的输入、硬件拓扑和采样点（step/token），`msprobe` 就可以比较它们并定位差异节点。如果找不到完全干净的标杆，可以先捕获问题侧数据，手动构建最小的可复现案例，并进行自比较。下文假设问题侧 dump 为 `problem_dump`，标杆侧 dump 为 `bench_dump`。"
+"通常需要准备两份 dump "
+"数据：一份来自出现精度或数值异常的“问题侧”，另一份来自表现正常的“标杆侧”。两份数据无需完全一致，可以来自不同分支、不同框架版本，甚至不同实现（算子替换、图优化开关差异等）。只要输入、硬件拓扑和采样点（step/token）保持一致或相近，msprobe"
+" 就能对比并定位差异节点。若无法找到足够干净的标杆，可先采集问题侧数据，手动构造最小复现用例并进行自对比。下文默认问题侧目录为 "
+"`problem_dump`，标杆侧为 `bench_dump`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:169
-msgid "5.2 Visualization"
-msgstr "5.2 可视化"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid "4.2 Visualization"
+msgstr "4.2 可视化"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:171
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Use `msprobe -f pytorch graph` to generate results that can be opened "
-"inside `tb_graph_ascend`."
-msgstr "使用 `msprobe -f pytorch graph` 生成结果，可在 `tb_graph_ascend` 中打开。"
+"Use `msprobe graph_visualize` to generate results that can be opened inside "
+"`tb_graph_ascend`."
+msgstr "使用 `msprobe graph_visualize` 生成结果，并在 `tb_graph_ascend` 中查看。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:173
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Ensure the dump contains `construct.json` (i.e., `level = L0` or `level ="
-" mix`)."
-msgstr "确保 dump 包含 `construct.json`（即 `level = L0` 或 `level = mix`）。"
+"Ensure the dump contains `construct.json` (i.e., `level = L0` or `level = "
+"mix`)."
+msgstr "确保 dump 中包含 `construct.json`（即 `level=L0` 或 `level=mix`）。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:174
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Prepare a comparison file such as `compare.json`. Its format and "
-"generation flow are described in section 3.1.3 of "
-"`msprobe_visualization.md`. Example (minimal runnable snippet):"
-msgstr "准备一个比较文件，例如 `compare.json`。其格式和生成流程在 `msprobe_visualization.md` 的 3.1.3 节中描述。示例（最小可运行片段）："
+"Prepare a comparison file such as `compare.json`. Its format and generation "
+"flow are described in section 3.1.3 of `msprobe_visualization.md`. Example "
+"(minimal runnable snippet):"
+msgstr ""
+"准备 `compare.json` 等对比文件，其格式与生成方式见 `msprobe_visualization.md` 3.1.3 节。示例："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:184
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Replace the paths with your dump directories before invoking `msprobe -f "
-"pytorch graph`. **If you only need to build a single graph**, omit "
+"Replace the paths with your dump directories before invoking `msprobe "
+"graph_visualize`. **If you only need to build a single graph**, omit "
 "`bench_path` to visualize one dump.   Multi-rank scenarios (single rank, "
 "multi-rank, or multi-step multi-rank) are also supported. `npu_path` or "
-"`bench_path` must contain folders named `rank+number`, and every rank "
-"folder must contain a non-empty `construct.json` together with "
-"`dump.json` and `stack.json`. If any `construct.json` is empty, verify "
-"that the dump level includes `L0` or `mix`. When comparing graphs, both "
-"`npu_path` and `bench_path` must contain the same set of rank folders so "
-"they can be paired one-to-one."
+"`bench_path` must contain folders named `rank+number`, and every rank folder"
+" must contain a non-empty `construct.json` together with `dump.json` and "
+"`stack.json`. If any `construct.json` is empty, verify that the dump level "
+"includes `L0` or `mix`. When comparing graphs, both `npu_path` and "
+"`bench_path` must contain the same set of rank folders so they can be paired"
+" one-to-one."
 msgstr ""
-"在调用 `msprobe -f pytorch graph` 之前，将路径替换为你的 dump 目录。**如果只需要构建单个图**，省略 `bench_path` 以可视化一个 dump。多 rank 场景（单 rank、多 rank 或多 step 多 rank）也受支持。`npu_path` 或 `bench_path` 必须包含名为 `rank+数字` 的文件夹，并且每个 rank 文件夹必须包含一个非空的 `construct.json` 以及 `dump.json` 和 `stack.json`。如果任何 `construct.json` 为空，请验证 dump 级别是否包含 `L0` 或 `mix`。比较图时，`npu_path` 和 `bench_path` 必须包含相同的 rank 文件夹集合，以便它们可以一一配对。"
+"在执行 `msprobe graph_visualize` 前，将路径替换为实际 dump 目录。**若只需构建单图**，可省略 "
+"`bench_path`。单 rank、多 rank 以及多 step 多 rank 场景均受支持：`npu_path` 或 `bench_path` "
+"下必须只有名为 `rank+数字` 的文件夹，并且每个 rank 目录都包含非空的 `construct.json`、`dump.json` 与 "
+"`stack.json`。若某个 `construct.json` 为空，请确认 dump 级别包含 L0 或 mix。做图比较时，两侧的 rank "
+"目录数量和名称必须一一对应。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:209
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Run:"
-msgstr "运行："
+msgstr "执行："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:217
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "After the comparison finishes, a `*.vis.db` file is created under "
 "`graph_output`."
-msgstr "比较完成后，会在 `graph_output` 下创建一个 `*.vis.db` 文件。"
+msgstr "对比完成后会在 `graph_output` 下生成 `*.vis.db` 文件。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:219
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 #, python-brace-format
 msgid "Graph build: `build_{timestamp}.vis.db`"
 msgstr "图构建：`build_{timestamp}.vis.db`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:220
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 #, python-brace-format
 msgid "Graph comparison: `compare_{timestamp}.vis.db`"
-msgstr "图比较：`compare_{timestamp}.vis.db`"
+msgstr "图对比：`compare_{timestamp}.vis.db`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:222
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Launch `tensorboard` and load the output directory to inspect structural "
-"differences, numerical comparisons, overflow detection results, cross-"
-"device communication nodes, and filters/search. Pass the directory "
-"containing the `.vis.db` files to `--logdir`:"
+"differences, numerical comparisons, overflow detection results, cross-device"
+" communication nodes, and filters/search. Pass the directory containing the "
+"`.vis.db` files to `--logdir`:"
 msgstr ""
-"启动 `tensorboard` 并加载输出目录，以检查结构差异、数值比较、溢出检测结果、跨设备通信节点以及过滤器/搜索。将包含 `.vis.db` 文件的目录传递给 `--logdir`："
+"启动 `tensorboard` 并加载输出目录，可查看结构差异、精度对比、溢出检测、跨卡通信节点以及多级目录搜索/筛选。将包含 `.vis.db` "
+"的目录传给 `--logdir`："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:228
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Inspect the visualization. The UI usually displays the overall model "
 "structure with operators, parameters, and tensor I/O. Click any node to "
 "expand its children."
-msgstr "检查可视化界面。UI 通常显示包含算子、参数和张量 I/O 的整体模型结构。点击任何节点以展开其子节点。"
+msgstr "在可视化界面中可查看模型整体结构（算子、参数、张量 I/O），点击节点可展开其子结构。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:229
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"**Difference visualization**: Comparison results highlight divergent "
-"nodes with different colors (the larger the difference, the redder the "
-"node). Click a node to view its detailed information including tensor "
-"inputs/outputs, parameters, and operator type. Analyze the data "
-"difference and the surrounding connections to pinpoint the exact "
-"divergence."
+"**Difference visualization**: Comparison results highlight divergent nodes "
+"with different colors (the larger the difference, the redder the node). "
+"Click a node to view its detailed information including tensor "
+"inputs/outputs, parameters, and operator type. Analyze the data difference "
+"and the surrounding connections to pinpoint the exact divergence."
 msgstr ""
-"**差异可视化**：比较结果用不同颜色突出显示差异节点（差异越大，节点越红）。点击节点可查看其详细信息，包括张量输入/输出、参数和算子类型。分析数据差异和周围连接，以精确定位确切的差异点。"
+"**差异可视化**：对比结果会使用不同颜色突出显示差异节点（差异越大颜色越红）。点击节点可查看输入输出张量、参数以及算子类型，据此结合上下游关系定位具体差异点。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:230
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "**Helper features**:"
 msgstr "**辅助功能**："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:231
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Switch rank/step: Quickly check difference nodes on different ranks and "
 "steps."
-msgstr "切换 rank/step：快速检查不同 rank 和 step 上的差异节点。"
+msgstr "切换 rank/step：快速查看不同 rank 和 step 下的差异节点。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:232
-msgid "Search/filter: Use the search box to filter nodes by operator name, etc."
-msgstr "搜索/过滤：使用搜索框按算子名称等过滤节点。"
-
-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:233
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"Manual mapping: Automatic mapping cannot cover every case, so the tool "
-"lets you manually map nodes between the problem and benchmark graphs "
-"before generating comparison results."
-msgstr "手动映射：自动映射无法覆盖所有情况，因此该工具允许你在生成比较结果之前，手动映射问题图和标杆图之间的节点。"
+"Search/filter: Use the search box to filter nodes by operator name, etc."
+msgstr "搜索/筛选：可根据算子名称等快速过滤节点。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:235
-msgid "6. Troubleshooting"
-msgstr "6. 故障排除"
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid ""
+"Manual mapping: Automatic mapping cannot cover every case, so the tool lets "
+"you manually map nodes between the problem and benchmark graphs before "
+"generating comparison results."
+msgstr "手动映射：当自动映射无法覆盖所有情况时，可手动匹配问题侧与标杆侧节点后再生成对比结果。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:237
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid "5. Troubleshooting"
+msgstr "5. 故障排查"
+
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "`RuntimeError: Please enforce eager mode`: Restart vLLM and add the "
 "`--enforce-eager` flag."
-msgstr "`RuntimeError: Please enforce eager mode`：重启 vLLM 并添加 `--enforce-eager` 标志。"
+msgstr ""
+"`RuntimeError: Please enforce eager mode`：重启 vLLM 并加上 `--enforce-eager` 参数。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:238
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "No dump files: Confirm that the JSON path is correct and every node has "
 "write permission. In distributed scenarios set `keep_all_ranks` so that "
 "every rank writes its own dump."
-msgstr "没有 dump 文件：确认 JSON 路径正确且每个节点都有写权限。在分布式场景中，设置 `keep_all_ranks` 以便每个 rank 写入自己的 dump。"
+msgstr ""
+"缺少 dump 文件：检查 JSON 路径是否正确、各节点是否具有写权限；分布式场景可启用 `keep_all_ranks` 让每个 rank "
+"单独写入。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:239
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "Dumps are too large: Start with a `statistics` task to locate abnormal "
-"tensors, then narrow the scope with `scope`/`list`/`tensor_list`, "
-"`filters`, `token_range`, etc."
-msgstr "Dump 文件过大：从 `statistics` 任务开始，定位异常张量，然后使用 `scope`/`list`/`tensor_list`、`filters`、`token_range` 等缩小范围。"
+"tensors, then narrow the scope with `scope`/`list`/`tensor_list`, `filters`,"
+" `token_range`, etc."
+msgstr ""
+"dump 体积过大：建议先运行 `statistics` 任务定位异常张量，再通过 "
+"`scope`/`list`/`tensor_list`、`filters`、`token_range` 等方式缩小范围。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:243
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "Appendix"
 msgstr "附录"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:245
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "dump.json file description"
 msgstr "dump.json 文件说明"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:247
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "L0 level"
 msgstr "L0 级别"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:249
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"An L0 `dump.json` contains forward I/O for modules together with "
-"parameters. Using PyTorch's `Conv2d` as an example, the network code "
-"looks like:"
-msgstr "L0 级别的 `dump.json` 包含模块的前向 I/O 以及参数。以 PyTorch 的 `Conv2d` 为例，网络代码如下："
+"An L0 `dump.json` contains forward/backward I/O for modules together with "
+"parameters and parameter gradients. Using PyTorch's `Conv2d` as an example, "
+"the network code looks like:"
+msgstr ""
+"L0 级别的 `dump.json` 包含模块的前/反向输入输出以及参数与参数梯度。以下以 PyTorch 的 `Conv2d` 为例，网络代码如下："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:251
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "`output = self.conv2(input)  # self.conv2 = torch.nn.Conv2d(64, 128, 5, "
 "padding=2, bias=True)`"
@@ -549,19 +563,36 @@ msgstr ""
 "`output = self.conv2(input)  # self.conv2 = torch.nn.Conv2d(64, 128, 5, "
 "padding=2, bias=True)`"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:253
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "`dump.json` contains the following entries:"
 msgstr "`dump.json` 包含以下条目："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:255
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"`Module.conv2.Conv2d.forward.0`: Forward data of the module. `input_args`"
-" represents positional inputs, `input_kwargs` represents keyword inputs, "
+"`Module.conv2.Conv2d.forward.0`: Forward data of the module. `input_args` "
+"represents positional inputs, `input_kwargs` represents keyword inputs, "
 "`output` stores forward outputs, and `parameters` stores weights/biases."
 msgstr ""
-"`Module.conv2.Conv2d.forward.0`：模块的前向数据。`input_args` 表示位置输入，`input_kwargs` 表示关键字输入，`output` 存储前向输出，`parameters` 存储权重/偏置。"
+"`Module.conv2.Conv2d.forward.0`：模块的前向数据，`input_args` 为位置参数，`input_kwargs` "
+"为关键字参数，`output` 存放前向输出，`parameters` 存放权重和偏置。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:257
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid ""
+"`Module.conv2.Conv2d.parameters_grad`: Parameter gradients (weight and "
+"bias)."
+msgstr "`Module.conv2.Conv2d.parameters_grad`：模块参数的梯度（weight 与 bias）。"
+
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid ""
+"`Module.conv2.Conv2d.backward.0`: Backward data of the module. `input` "
+"represents gradients that flow into the module (gradients of the forward "
+"outputs) and `output` represents gradients that flow out (gradients of the "
+"module inputs)."
+msgstr ""
+"`Module.conv2.Conv2d.backward.0`：模块的反向数据，`input` 表示流入模块的梯度（对应前向输出），`output` "
+"表示流出的梯度（对应模块输入）。"
+
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 #, python-brace-format
 msgid ""
 "**Note**: When the `model` parameter passed to the dump API is "
@@ -569,32 +600,47 @@ msgid ""
 "include the index inside the list (`{Module}.{index}.*`). Example: "
 "`Module.0.conv1.Conv2d.forward.0`."
 msgstr ""
-"**注意**：当传递给 dump API 的 `model` 参数是 `List[torch.nn.Module]` 或 `Tuple[torch.nn.Module]` 时，模块级名称包含列表内的索引（`{Module}.{index}.*`）。例如：`Module.0.conv1.Conv2d.forward.0`。"
+"**说明**：当 dump API 的 `model` 参数为 `List[torch.nn.Module]` 或 "
+"`Tuple[torch.nn.Module]` 时，模块级名称会包含其在列表中的索引（`{Module}.{index}.*`），例如 "
+"`Module.0.conv1.Conv2d.forward.0`。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:341
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "L1 level"
 msgstr "L1 级别"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:343
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"An L1 `dump.json` records forward I/O for APIs. Using PyTorch's `relu` "
-"function as an example (`output = torch.nn.functional.relu(input)`), the "
-"file contains:"
-msgstr "L1 级别的 `dump.json` 记录 API 的前向 I/O。以 PyTorch 的 `relu` 函数为例（`output = torch.nn.functional.relu(input)`），该文件包含："
+"An L1 `dump.json` records forward/backward I/O for APIs. Using PyTorch's "
+"`relu` function as an example (`output = torch.nn.functional.relu(input)`), "
+"the file contains:"
+msgstr ""
+"L1 级别的 `dump.json` 记录 API 的前/反向输入输出。以下以 PyTorch 的 `relu` 函数（`output = "
+"torch.nn.functional.relu(input)`）为例："

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:345
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
 "`Functional.relu.0.forward`: Forward data of the API. `input_args` are "
-"positional inputs, `input_kwargs` are keyword inputs, and `output` stores"
-" the forward outputs."
-msgstr "`Functional.relu.0.forward`：API 的前向数据。`input_args` 是位置输入，`input_kwargs` 是关键字输入，`output` 存储前向输出。"
+"positional inputs, `input_kwargs` are keyword inputs, and `output` stores "
+"the forward outputs."
+msgstr ""
+"`Functional.relu.0.forward`：API 的前向数据，`input_args` 为位置输入，`input_kwargs` "
+"为关键字输入，`output` 存放前向输出。"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:398
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
+msgid ""
+"`Functional.relu.0.backward`: Backward data of the API. `input` represents "
+"the gradients of the forward outputs, and `output` represents the gradients "
+"that flow back to the forward inputs."
+msgstr ""
+"`Functional.relu.0.backward`：API 的反向数据，`input` 表示前向输出的梯度，`output` "
+"表示回传到前向输入的梯度。"
+
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid "mix level"
 msgstr "mix 级别"

-#: ../../source/developer_guide/performance_and_debug/msprobe_guide.md:400
+#: ../../developer_guide/performance_and_debug/msprobe_guide.md
 msgid ""
-"A `mix` dump.json contains both L0 and L1 level data; the file format is "
-"the same as the examples above."
-msgstr "`mix` 级别的 dump.json 包含 L0 和 L1 级别的数据；文件格式与上述示例相同。"
+"A `mix` dump.json contains both L0 and L1 level data; the file format is the"
+" same as the examples above."
+msgstr "`mix` 级别的 dump.json 同时包含 L0 与 L1 数据，文件格式与上述示例相同。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/optimization_and_tuning.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/optimization_and_tuning.po
@@ -1,348 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:1
-msgid "Optimization and Tuning"
-msgstr "优化与调优"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:3
-msgid ""
-"This guide aims to help users improve vLLM-Ascend performance at the "
-"system level. It includes OS configuration, library optimization, "
-"deployment guide, and so on. Any feedback is welcome."
-msgstr "本指南旨在帮助用户在系统层面提升 vLLM-Ascend 的性能。内容包括操作系统配置、库优化、部署指南等。欢迎提供任何反馈。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:5
-msgid "Preparation"
-msgstr "准备工作"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:7
-msgid "Run the container:"
-msgstr "运行容器："
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:31
-msgid "Configure your environment:"
-msgstr "配置您的环境："
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:49
-msgid "Install vllm and vllm-ascend:"
-msgstr "安装 vllm 和 vllm-ascend："
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:61
-msgid ""
-"Please follow the [Installation "
-"Guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) "
-"to make sure vLLM and vllm-ascend are installed correctly."
-msgstr "请遵循[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)以确保 vLLM 和 vllm-ascend 正确安装。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:64
-msgid ""
-"Make sure your vLLM and vllm-ascend are installed after your Python "
-"configuration is completed, because these packages will build binary "
-"files using python in current environment. If you install vLLM and vllm-"
-"ascend before completing section 1.1, the binary files will not use the "
-"optimized python."
-msgstr "请确保在完成 Python 配置后再安装 vLLM 和 vllm-ascend，因为这些软件包将使用当前环境中的 python 构建二进制文件。如果您在完成第 1.1 节之前就安装了 vLLM 和 vllm-ascend，则二进制文件将不会使用优化后的 python。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:67
-msgid "Optimizations"
-msgstr "优化措施"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:69
-msgid "1. Compilation Optimization"
-msgstr "1. 编译优化"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:71
-msgid "1.1. Install optimized `python`"
-msgstr "1.1. 安装优化版 `python`"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:73
-msgid ""
-"Python supports **LTO** and **PGO** optimization starting from version "
-"`3.6` and above, which can be enabled at compile time. And we have "
-"offered optimized `python` packages directly to users for the sake of "
-"convenience. You can also reproduce the `python` build following this "
-"[tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html)"
-" according to your specific scenarios."
-msgstr "Python 从 `3.6` 及以上版本开始支持 **LTO** 和 **PGO** 优化，可以在编译时启用。为了方便用户，我们直接提供了优化版的 `python` 软件包。您也可以根据具体场景，按照此[教程](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html)自行构建 `python`。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:101
-msgid "2. OS Optimization"
-msgstr "2. 操作系统优化"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:103
-msgid "2.1. jemalloc"
-msgstr "2.1. jemalloc"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:105
-msgid ""
-"**jemalloc** is a memory allocator that improves performance for multi-"
-"threaded scenarios and can reduce memory fragmentation. jemalloc uses a "
-"local thread memory manager to allocate variables, which can avoid lock "
-"competition between threads and can hugely optimize performance."
-msgstr "**jemalloc** 是一个内存分配器，可提升多线程场景下的性能并减少内存碎片。jemalloc 使用本地线程内存管理器来分配变量，这可以避免线程间的锁竞争，从而大幅优化性能。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:117
-msgid "2.2. Tcmalloc"
-msgstr "2.2. Tcmalloc"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:119
-msgid ""
-"**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that"
-" improves overall performance while ensuring low latency by introducing a"
-" multi-level cache structure, reducing mutex contention and optimizing "
-"large object processing flow. Find more "
-"[details](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html)."
-msgstr "**TCMalloc (Thread Caching Malloc)** 是一个通用内存分配器，通过引入多级缓存结构、减少互斥锁竞争以及优化大对象处理流程，在确保低延迟的同时提升整体性能。更多[详情](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html)。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:140
-msgid "3. `torch_npu` Optimization"
-msgstr "3. `torch_npu` 优化"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:142
-msgid ""
-"Some performance tuning features in `torch_npu` are controlled by "
-"environment variables. Some features and their related environment "
-"variables are shown below."
-msgstr "`torch_npu` 中的一些性能调优功能由环境变量控制。部分功能及其相关环境变量如下所示。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:144
-msgid "Memory optimization:"
-msgstr "内存优化："
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:155
-msgid "Scheduling optimization:"
-msgstr "调度优化："
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:166
-msgid "4. CANN Optimization"
-msgstr "4. CANN 优化"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:168
-msgid "4.1. HCCL Optimization"
-msgstr "4.1. HCCL 优化"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:170
-msgid ""
-"There are some performance tuning features in HCCL, which are controlled "
-"by environment variables."
-msgstr "HCCL 中有一些性能调优功能，由环境变量控制。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:172
-msgid ""
-"You can configure HCCL to use \"AIV\" mode to optimize performance by "
-"setting the environment variable shown below. In \"AIV\" mode, the "
-"communication is scheduled by AI vector core directly with RoCE, instead "
-"of being scheduled by AI CPU."
-msgstr "您可以通过设置如下所示的环境变量，将 HCCL 配置为使用 \"AIV\" 模式以优化性能。在 \"AIV\" 模式下，通信由 AI 向量核通过 RoCE 直接调度，而非由 AI CPU 调度。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:179
-msgid ""
-"Plus, there are more features for performance optimization in specific "
-"scenarios, which are shown below."
-msgstr "此外，针对特定场景还有更多性能优化功能，如下所示。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:181
-msgid ""
-"`HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two "
-"8Ps as the mesh interconnect link. Find more "
-"[details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html)."
-msgstr "`HCCL_INTRA_ROCE_ENABLE`：在两个 8P 之间使用 RDMA 链路而非 SDMA 链路作为网状互连链路。更多[详情](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html)。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:182
-msgid ""
-"`HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find"
-" more "
-"[details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html)."
-msgstr "`HCCL_RDMA_TC`：使用此变量配置 RDMA 网卡的流量类别。更多[详情](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html)。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:183
-msgid ""
-"`HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find"
-" more "
-"[details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html)."
-msgstr "`HCCL_RDMA_SL`：使用此变量配置 RDMA 网卡的服务级别。更多[详情](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html)。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:184
-msgid ""
-"`HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data "
-"between two NPUs. Find more "
-"[details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html)."
-msgstr "`HCCL_BUFFSIZE`：使用此变量控制两个 NPU 之间共享数据的缓存大小。更多[详情](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html)。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:186
-msgid "5. OS Optimization"
-msgstr "5. 操作系统优化"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:188
-msgid ""
-"This section describes operating system–level optimizations applied on "
-"the host machine (bare metal or Kubernetes node) to improve performance "
-"stability, latency, and throughput for inference workloads."
-msgstr "本节描述了在主机（裸机或 Kubernetes 节点）上应用的操作系统级优化，旨在提升推理工作负载的性能稳定性、延迟和吞吐量。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:191
-msgid ""
-"These settings must be applied on the host OS and with root privileges. "
-"Not inside containers."
-msgstr "这些设置必须在主机操作系统上以 root 权限应用，而不是在容器内部。"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:194
-msgid "5.1"
-msgstr "5.1"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:196
-msgid "Set CPU Frequency Governor to `performance`"
-msgstr "将 CPU 频率调节器设置为 `performance`"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:202
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:219
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:239
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:261
-msgid "Purpose"
-msgstr "目的"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:204
-msgid "Forces all CPU cores to run under the `performance` governor"
-msgstr "强制所有 CPU 核心在 `performance` 调节器下运行"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:205
-msgid "Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)"
-msgstr "禁用动态频率调节（例如 `ondemand`、`powersave`）"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:207
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:223
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:243
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:265
-msgid "Benefits"
-msgstr "优势"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:209
-msgid "Keeps CPU cores at maximum frequency"
-msgstr "使 CPU 核心保持最高频率"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:210
-msgid "Reduces latency jitter"
-msgstr "减少延迟抖动"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:211
-msgid "Improves predictability for inference workloads"
-msgstr "提高推理工作负载的可预测性"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:213
-msgid "5.2 Disable Swap Usage"
-msgstr "5.2 禁用交换空间使用"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:221
-msgid "Minimizes the kernel’s tendency to swap memory pages to disk"
-msgstr "最小化内核将内存页交换到磁盘的倾向"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:225
-msgid "Prevents severe latency spikes caused by swapping"
-msgstr "防止因交换导致的严重延迟峰值"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:226
-msgid "Improves stability for large in-memory models"
-msgstr "提高大型内存模型的稳定性"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:228
-msgid "Notes"
-msgstr "备注"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:230
-msgid "For inference workloads, swap can introduce second-level latency"
-msgstr "对于推理工作负载，交换可能导致秒级延迟"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:231
-msgid "Recommended values are `0` or `1`"
-msgstr "推荐值为 `0` 或 `1`"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:233
-msgid "5.3 Disable Automatic NUMA Balancing"
-msgstr "5.3 禁用自动 NUMA 平衡"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:241
-msgid "Disables the kernel’s automatic NUMA page migration mechanism"
-msgstr "禁用内核的自动 NUMA 页面迁移机制"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:245
-msgid "Prevents background memory page migrations"
-msgstr "防止后台内存页迁移"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:246
-msgid "Reduces unpredictable memory access latency"
-msgstr "减少不可预测的内存访问延迟"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:247
-msgid "Improves performance stability on NUMA systems"
-msgstr "提高 NUMA 系统上的性能稳定性"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:249
-msgid "Recommended For"
-msgstr "推荐用于"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:251
-msgid "Multi-socket servers"
-msgstr "多插槽服务器"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:252
-msgid "Ascend / NPU deployments with explicit NUMA binding"
-msgstr "具有显式 NUMA 绑定的 Ascend / NPU 部署"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:253
-msgid "Systems with manually managed CPU and memory affinity"
-msgstr "手动管理 CPU 和内存亲和性的系统"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:255
-msgid "5.4 Increase Scheduler Migration Cost"
-msgstr "5.4 增加调度器迁移成本"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:263
-msgid "Increases the cost for the scheduler to migrate tasks between CPU cores"
-msgstr "增加调度器在 CPU 核心间迁移任务的成本"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:267
-msgid "Reduces frequent thread migration"
-msgstr "减少频繁的线程迁移"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:268
-msgid "Improves CPU cache locality"
-msgstr "提高 CPU 缓存局部性"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:269
-msgid "Lowers latency jitter for inference workloads"
-msgstr "降低推理工作负载的延迟抖动"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:271
-msgid "Parameter Details"
-msgstr "参数详情"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:273
-msgid "Unit: nanoseconds (ns)"
-msgstr "单位：纳秒 (ns)"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:274
-msgid "Typical recommended range: 50000–100000"
-msgstr "典型推荐范围：50000–100000"
-
-#: ../../source/developer_guide/performance_and_debug/optimization_and_tuning.md:275
-msgid "Higher values encourage threads to stay on the same CPU core"
-msgstr "更高的值鼓励线程保持在同一个 CPU 核心上"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/performance_benchmark.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/performance_benchmark.po
@@ -4,338 +4,85 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:1
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:1
 msgid "Performance Benchmark"
-msgstr "性能基准测试"
+msgstr "性能基准"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:3
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:2
 msgid ""
-"This document details the benchmark methodology for vllm-ascend, aimed at"
-" evaluating the performance under a variety of workloads. To maintain "
+"This document details the benchmark methodology for vllm-ascend, aimed at "
+"evaluating the performance under a variety of workloads. To maintain "
 "alignment with vLLM, we use the [benchmark](https://github.com/vllm-"
 "project/vllm/tree/main/benchmarks) script provided by the vllm project."
 msgstr ""
-"本文档详细说明了 vllm-ascend 的基准测试方法，旨在评估其在多种工作负载下的性能。为了与 vLLM 保持一致，我们使用 vllm "
-"项目提供的 [benchmark](https://github.com/vllm-"
-"project/vllm/tree/main/benchmarks) 脚本。"
+"本文档详细说明了 vllm-ascend 的基准测试方法，旨在评估其在多种工作负载下的性能。为了与 vLLM 保持一致，我们使用 vllm 项目提供的 "
+"[benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) 脚本。"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:5
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:4
 msgid ""
-"**Benchmark Coverage**: We measure offline E2E latency and throughput, "
-"and fixed-QPS online serving benchmarks. For more details, see [vllm-"
-"ascend benchmark scripts](https://github.com/vllm-project/vllm-"
+"**Benchmark Coverage**: We measure offline e2e latency and throughput, and "
+"fixed-QPS online serving benchmarks, for more details see [vllm-ascend "
+"benchmark scripts](https://github.com/vllm-project/vllm-"
 "ascend/tree/main/benchmarks)."
 msgstr ""
 "**基准测试覆盖范围**：我们测量离线端到端延迟和吞吐量，以及固定 QPS 的在线服务基准测试。更多详情请参见 [vllm-ascend "
-"基准测试脚本](https://github.com/vllm-project/vllm-"
-"ascend/tree/main/benchmarks)。"
+"基准测试脚本](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)。"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:7
-msgid "**Legend Description**:"
-msgstr "**图例说明**："
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:9
-msgid "✅ = Supported"
-msgstr "✅ = 已支持"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:10
-msgid "🟡 = Partial / Work in progress"
-msgstr "🟡 = 部分支持 / 开发中"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:11
-msgid "🚧 = Under development"
-msgstr "🚧 = 开发中"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:13
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:6
 msgid "1. Run docker container"
-msgstr "1. 运行 Docker 容器"
+msgstr "1. 运行 docker 容器"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:39
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:31
 msgid "2. Install dependencies"
 msgstr "2. 安装依赖项"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:47
-msgid "3. Run basic benchmarks"
-msgstr "3. 运行基础基准测试"
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:38
+msgid "3. (Optional)Prepare model weights"
+msgstr "3.（可选）准备模型权重"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:49
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:39
 msgid ""
-"This section introduces how to perform performance testing using the "
-"benchmark suite built into VLLM."
-msgstr "本节介绍如何使用 VLLM 内置的基准测试套件进行性能测试。"
+"For faster running speed, we recommend downloading the model in advance："
+msgstr "为了更快的运行速度，建议提前下载模型："

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:51
-msgid "3.1 Dataset"
-msgstr "3.1 数据集"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:53
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:44
 msgid ""
-"VLLM supports a variety of [datasets](https://github.com/vllm-"
-"project/vllm/blob/main/vllm/benchmarks/datasets.py)."
-msgstr "VLLM 支持多种[数据集](https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py)。"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Dataset"
-msgstr "数据集"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Online"
-msgstr "在线"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Offline"
-msgstr "离线"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Data Path"
-msgstr "数据路径"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "ShareGPT"
-msgstr "ShareGPT"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "✅"
-msgstr "✅"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid ""
-"`wget "
-"https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`"
+"You can also replace all model paths in the [json](https://github.com/vllm-"
+"project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:"
 msgstr ""
-"`wget "
-"https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`"
+"你也可以将 [json](https://github.com/vllm-project/vllm-"
+"ascend/tree/main/benchmarks/tests) 文件中的所有模型路径替换为你的本地路径："

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "ShareGPT4V (Image)"
-msgstr "ShareGPT4V (图像)"
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:60
+msgid "4. Run benchmark script"
+msgstr "4. 运行基准测试脚本"

-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:61
+msgid "Run benchmark script:"
+msgstr "运行基准测试脚本："
+
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:66
+msgid "After about 10 mins, the output is as shown below:"
+msgstr "大约 10 分钟后，输出如下所示："
+
+#: ../../developer_guide/performance_and_debug/performance_benchmark.md:176
 msgid ""
-"`wget https://huggingface.co/datasets/Lin-"
-"Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note"
-" that the images need to be downloaded separately. For example, to "
-"download COCO's 2017 Train images:<br>`wget "
-"http://images.cocodataset.org/zips/train2017.zip`"
-msgstr ""
-"`wget https://huggingface.co/datasets/Lin-"
-"Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>请注意，图像需要单独下载。例如，要下载"
-" COCO 2017 训练集图像：<br>`wget "
-"http://images.cocodataset.org/zips/train2017.zip`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "ShareGPT4Video (Video)"
-msgstr "ShareGPT4Video (视频)"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video`"
-msgstr "`git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "BurstGPT"
-msgstr "BurstGPT"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid ""
-"`wget "
-"https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`"
-msgstr ""
-"`wget "
-"https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Sonnet (deprecated)"
-msgstr "Sonnet (已弃用)"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Local file: `benchmarks/sonnet.txt`"
-msgstr "本地文件：`benchmarks/sonnet.txt`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Random"
-msgstr "随机"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`synthetic`"
-msgstr "`synthetic`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "RandomMultiModal (Image/Video)"
-msgstr "RandomMultiModal (图像/视频)"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "🟡"
-msgstr "🟡"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "🚧"
-msgstr "🚧"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "RandomForReranking"
-msgstr "RandomForReranking"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Prefix Repetition"
-msgstr "前缀重复"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-VisionArena"
-msgstr "HuggingFace-VisionArena"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`lmarena-ai/VisionArena-Chat`"
-msgstr "`lmarena-ai/VisionArena-Chat`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-MMVU"
-msgstr "HuggingFace-MMVU"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`yale-nlp/MMVU`"
-msgstr "`yale-nlp/MMVU`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-InstructCoder"
-msgstr "HuggingFace-InstructCoder"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`likaixin/InstructCoder`"
-msgstr "`likaixin/InstructCoder`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-AIMO"
-msgstr "HuggingFace-AIMO"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid ""
-"`AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-"
-"CoT`"
-msgstr "`AI-MO/aimo-validation-aime`, `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-Other"
-msgstr "HuggingFace-其他"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered`"
-msgstr "`lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-MTBench"
-msgstr "HuggingFace-MTBench"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`philschmid/mt-bench`"
-msgstr "`philschmid/mt-bench`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "HuggingFace-Blazedit"
-msgstr "HuggingFace-Blazedit"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "`vdaita/edit_5k_char`, `vdaita/edit_10k_char`"
-msgstr "`vdaita/edit_5k_char`, `vdaita/edit_10k_char`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Spec Bench"
-msgstr "Spec Bench"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid ""
-"`wget https://raw.githubusercontent.com/hemingkx/Spec-"
-"Bench/refs/heads/main/data/spec_bench/question.jsonl`"
-msgstr "`wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Custom"
-msgstr "自定义"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:15
-msgid "Local file: `data.jsonl`"
-msgstr "本地文件：`data.jsonl`"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:83
-msgid ""
-"The datasets mentioned above are all links to datasets on huggingface. "
-"The dataset's `dataset-name` should be set to `hf`. For local `dataset-"
-"path`, please set `hf-name` to its Hugging Face ID like"
-msgstr "上述提到的数据集均为 Hugging Face 上数据集的链接。数据集的 `dataset-name` 应设置为 `hf`。对于本地的 `dataset-path`，请将 `hf-name` 设置为其 Hugging Face ID，例如："
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:93
-msgid "3.2 Run basic benchmark"
-msgstr "3.2 运行基础基准测试"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:95
-msgid "3.2.1 Online serving"
-msgstr "3.2.1 在线服务"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:97
-msgid "First start serving your model:"
-msgstr "首先启动模型服务："
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:103
-msgid "Then run the benchmarking script:"
-msgstr "然后运行基准测试脚本："
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:118
-msgid "If successful, you will see the following output:"
-msgstr "如果成功，您将看到以下输出："
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:147
-msgid "3.2.2 Offline Throughput Benchmark"
-msgstr "3.2.2 离线吞吐量基准测试"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:158
-msgid "If successful, you will see the following output"
-msgstr "如果成功，您将看到以下输出"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:167
-msgid "3.2.4 Multi-Modal Benchmark"
-msgstr "3.2.4 多模态基准测试"
-
-#: ../../source/developer_guide/performance_and_debug/performance_benchmark.md:216
-msgid "3.2.5 Embedding Benchmark"
-msgstr "3.2.5 嵌入基准测试"
-
-#~ msgid "3. (Optional)Prepare model weights"
-#~ msgstr "3.（可选）准备模型权重"
-
-#~ msgid ""
-#~ "For faster running speed, we recommend"
-#~ " downloading the model in advance："
-#~ msgstr "为了获得更快的运行速度，我们建议提前下载模型："
-
-#~ msgid ""
-#~ "You can also replace all model "
-#~ "paths in the [json](https://github.com/vllm-"
-#~ "project/vllm-ascend/tree/main/benchmarks/tests) files "
-#~ "with your local paths:"
-#~ msgstr ""
-#~ "您也可以将 [json](https://github.com/vllm-project/vllm-"
-#~ "ascend/tree/main/benchmarks/tests) 文件中的所有模型路径替换为您的本地路径："
-
-#~ msgid "After about 10 mins, the output is as shown below:"
-#~ msgstr "大约 10 分钟后，输出如下所示："
-
-#~ msgid ""
-#~ "The result json files are generated "
-#~ "into the path `benchmark/results` These "
-#~ "files contain detailed benchmarking results"
-#~ " for further analysis."
-#~ msgstr "结果 JSON 文件将生成到路径 `benchmark/results`。这些文件包含详细的基准测试结果，可用于进一步分析。"
+"The result json files are generated into the path `benchmark/results` These "
+"files contain detailed benchmarking results for further analysis."
+msgstr "结果 json 文件会生成到路径 `benchmark/results`。这些文件包含了用于进一步分析的详细基准测试结果。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/service_profiling_guide.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/service_profiling_guide.po
--- a/docs/source/locale/zh_CN/LC_MESSAGES/faqs.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/faqs.po
--- a/docs/source/locale/zh_CN/LC_MESSAGES/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/index.po
@@ -4,71 +4,76 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: 2025-07-18 10:05+0800\n"
 "Last-Translator: \n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"
+"X-Generator: Poedit 3.5\n"

-#: ../../source/index.md:33
+#: ../../index.md:33
 msgid "Getting Started"
 msgstr "快速开始"

-#: ../../source/index.md:45
+#: ../../index.md:43
 msgid "User Guide"
 msgstr "用户指南"

-#: ../../source/index.md:56
+#: ../../index.md:53
 msgid "Developer Guide"
 msgstr "开发者指南"

-#: ../../source/index.md:66
+#: ../../index.md:64
 msgid "Community"
 msgstr "社区"

-#: ../../source/index.md:1
+#: ../../index.md:1
 msgid "Welcome to vLLM Ascend Plugin"
 msgstr "欢迎使用 vLLM Ascend 插件"

-#: ../../source/index.md:3
+#: ../../index.md:3
 msgid "vLLM"
 msgstr "vLLM"

-#: ../../source/index.md:24
+#: ../../index.md:24
 msgid ""
-"vLLM Ascend plugin (vllm-ascend) is a community-maintained hardware "
-"plugin for running vLLM on the Ascend NPU."
-msgstr "vLLM Ascend 插件（vllm-ascend）是一个由社区维护的硬件插件，用于在昇腾 NPU 上运行 vLLM。"
-
-#: ../../source/index.md:26
-msgid ""
-"This plugin is the recommended approach for supporting the Ascend backend"
-" within the vLLM community. It adheres to the principles outlined in the "
-"[[RFC]: Hardware pluggable](https://github.com/vllm-"
-"project/vllm/issues/11162), providing a hardware-pluggable interface that"
-" decouples the integration of the Ascend NPU with vLLM."
+"vLLM Ascend plugin (vllm-ascend) is a community maintained hardware plugin "
+"for running vLLM on the Ascend NPU."
 msgstr ""
-"该插件是 vLLM 社区内支持 Ascend 后端的推荐方法。它遵循 [[RFC]: Hardware "
-"pluggable](https://github.com/vllm-project/vllm/issues/11162) "
-"中概述的原则，提供了一个硬件可插拔接口，将昇腾 NPU 与 vLLM 的集成解耦。"
+"vLLM Ascend 插件（vllm-ascend）是一个由社区维护的硬件插件，用于在 Ascend "
+"NPU 上运行 vLLM。"

-#: ../../source/index.md:28
+#: ../../index.md:26
+msgid ""
+"This plugin is the recommended approach for supporting the Ascend backend "
+"within the vLLM community. It adheres to the principles outlined in the "
+"[[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/"
+"issues/11162), providing a hardware-pluggable interface that decouples the "
+"integration of the Ascend NPU with vLLM."
+msgstr ""
+"该插件是 vLLM 社区推荐用于支持 Ascend 后端的方法。它遵循 [[RFC]: Hardware "
+"pluggable](https://github.com/vllm-project/vllm/issues/11162) 中提出的原"
+"则，提供了一个硬件可插拔接口，实现了 Ascend NPU 与 vLLM 集成的解耦。"
+
+#: ../../index.md:28
 msgid ""
 "By using vLLM Ascend plugin, popular open-source models, including "
-"Transformer-like, Mixture-of-Experts, Embedding, Multi-modal LLMs can run"
-" seamlessly on the Ascend NPU."
+"Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run "
+"seamlessly on the Ascend NPU."
 msgstr ""
-"通过使用 vLLM Ascend 插件，包括类 Transformer、混合专家、嵌入和多模态大语言模型在内的流行开源模型，都可以在昇腾 NPU 上无缝运行。"
+"通过使用 vLLM Ascend 插件，流行的开源模型，包括 Transformer 类、混合专家、"
+"嵌入式、多模态大模型等，都可以在 Ascend NPU 上无缝运行。"

-#: ../../source/index.md:30
+#: ../../index.md:30
 msgid "Documentation"
-msgstr "文档"
+msgstr "文档"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/installation.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/installation.po
@@ -4,517 +4,290 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: 2025-07-18 10:09+0800\n"
 "Last-Translator: \n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"
+"X-Generator: Poedit 3.5\n"

-#: ../../source/installation.md:1
+#: ../../installation.md:1
 msgid "Installation"
 msgstr "安装"

-#: ../../source/installation.md:3
+#: ../../installation.md:3
 msgid "This document describes how to install vllm-ascend manually."
 msgstr "本文档介绍如何手动安装 vllm-ascend。"

-#: ../../source/installation.md:5
+#: ../../installation.md:5
 msgid "Requirements"
-msgstr "系统要求"
+msgstr "要求"

-#: ../../source/installation.md:7
+#: ../../installation.md:7
 msgid "OS: Linux"
 msgstr "操作系统：Linux"

-#: ../../source/installation.md:8
-msgid "Python: >= 3.10, < 3.12"
-msgstr "Python：>= 3.10，< 3.12"
+#: ../../installation.md:8
+msgid "Python: >= 3.9, < 3.12"
+msgstr "Python：>= 3.9，< 3.12"

-#: ../../source/installation.md:9
-msgid "Hardware with Ascend NPUs. It's usually the Atlas 800 A2 series."
-msgstr "配备昇腾 NPU 的硬件，通常是 Atlas 800 A2 系列。"
+#: ../../installation.md:9
+msgid "A hardware with Ascend NPU. It's usually the Atlas 800 A2 series."
+msgstr "配备有昇腾NPU的硬件，通常是Atlas 800 A2系列。"

-#: ../../source/installation.md:10
+#: ../../installation.md:10
 msgid "Software:"
 msgstr "软件："

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Software"
 msgstr "软件"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Supported version"
 msgstr "支持的版本"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Note"
-msgstr "备注"
+msgstr "注释"

-#: ../../source/installation.md
-msgid "Ascend HDK"
-msgstr "昇腾 HDK"
-
-#: ../../source/installation.md
-msgid ""
-"Refer to the documentation [CANN "
-"8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html)"
-msgstr ""
-"请参考文档 [CANN "
-"8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html)"
-
-#: ../../source/installation.md
-msgid "Required for CANN"
-msgstr "CANN 所需"
-
-#: ../../source/installation.md
+#: ../../installation.md
 msgid "CANN"
 msgstr "CANN"

-#: ../../source/installation.md
-msgid "== 8.5.1"
-msgstr "== 8.5.1"
+#: ../../installation.md
+msgid ">= 8.1.RC1"
+msgstr ">= 8.1.RC1"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Required for vllm-ascend and torch-npu"
-msgstr "vllm-ascend 和 torch-npu 所需"
+msgstr "vllm-ascend 和 torch-npu 必需"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "torch-npu"
 msgstr "torch-npu"

-#: ../../source/installation.md
-msgid "== 2.9.0"
-msgstr "== 2.9.0"
+#: ../../installation.md
+msgid ">= 2.5.1.post1.dev20250619"
+msgstr ">= 2.5.1.post1.dev20250619"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid ""
 "Required for vllm-ascend, No need to install manually, it will be auto "
 "installed in below steps"
-msgstr "vllm-ascend 所需，无需手动安装，将在后续步骤中自动安装"
+msgstr "vllm-ascend 必需，无需手动安装，后续步骤会自动安装。"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "torch"
 msgstr "torch"

-#: ../../source/installation.md
+#: ../../installation.md
+msgid ">= 2.5.1"
+msgstr ">= 2.5.1"
+
+#: ../../installation.md
 msgid "Required for torch-npu and vllm"
 msgstr "torch-npu 和 vllm 所需"

-#: ../../source/installation.md
-msgid "NNAL"
-msgstr "NNAL"
+#: ../../installation.md:18
+msgid "You have 2 way to install:"
+msgstr "你有两种安装方式："

-#: ../../source/installation.md
-msgid "Required for libatb.so, enables advanced tensor operations"
-msgstr "libatb.so 所需，用于启用高级张量运算"
-
-#: ../../source/installation.md:20
-msgid "There are two installation methods:"
-msgstr "有两种安装方法："
-
-#: ../../source/installation.md:22
+#: ../../installation.md:19
 msgid ""
-"**Using pip**: first prepare the environment manually or via a CANN "
-"image, then install `vllm-ascend` using pip."
-msgstr "**使用 pip**：首先手动或通过 CANN 镜像准备环境，然后使用 pip 安装 `vllm-ascend`。"
+"**Using pip**: first prepare env manually or via CANN image, then install "
+"`vllm-ascend` using pip."
+msgstr ""
+"**使用 pip**：首先手动准备环境或通过 CANN 镜像准备环境，然后使用 pip 安装 "
+"`vllm-ascend`。"

-#: ../../source/installation.md:23
-msgid "**Using docker**: use the `vllm-ascend` pre-built docker image directly."
+#: ../../installation.md:20
+msgid ""
+"**Using docker**: use the `vllm-ascend` pre-built docker image directly."
 msgstr "**使用 docker**：直接使用 `vllm-ascend` 预构建的 docker 镜像。"

-#: ../../source/installation.md:25
-msgid "Configure Ascend CANN environment"
-msgstr "配置昇腾 CANN 环境"
+#: ../../installation.md:22
+msgid "Configure a new environment"
+msgstr "配置一个新环境"

-#: ../../source/installation.md:27
+#: ../../installation.md:24
 msgid ""
-"Before installation, you need to make sure firmware/driver, and CANN are "
-"installed correctly, refer to [Ascend Environment Setup "
-"Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) "
-"for more details."
+"Before installing, you need to make sure firmware/driver and CANN are "
+"installed correctly, refer to [link](https://ascend.github.io/docs/sources/"
+"ascend/quick_install.html) for more details."
 msgstr ""
-"安装前，您需要确保固件/驱动和 CANN 已正确安装，更多详情请参考 "
-"[昇腾环境搭建指南](https://ascend.github.io/docs/sources/ascend/quick_install.html)。"
+"在安装之前，您需要确保固件/驱动和 CANN 已正确安装，更多详情请参考 [链接]"
+"(https://ascend.github.io/docs/sources/ascend/quick_install.html)。"

-#: ../../source/installation.md:29
+#: ../../installation.md:26
 msgid "Configure hardware environment"
 msgstr "配置硬件环境"

-#: ../../source/installation.md:31
+#: ../../installation.md:28
 msgid ""
-"To verify that the Ascend NPU firmware and driver were correctly "
-"installed, run:"
-msgstr "要验证昇腾 NPU 固件和驱动程序是否正确安装，请运行："
+"To verify that the Ascend NPU firmware and driver were correctly installed, "
+"run:"
+msgstr "要验证 Ascend NPU 固件和驱动程序是否正确安装，请运行："

-#: ../../source/installation.md:37
+#: ../../installation.md:34
 msgid ""
-"Refer to [Ascend Environment Setup "
-"Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) "
-"for more details."
+"Refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/"
+"sources/ascend/quick_install.html) for more details."
 msgstr ""
-"更多详情请参考 "
-"[昇腾环境搭建指南](https://ascend.github.io/docs/sources/ascend/quick_install.html)。"
+"更多详情请参考[Ascend环境搭建指南](https://ascend.github.io/docs/sources/"
+"ascend/quick_install.html)。"

-#: ../../source/installation.md:39
+#: ../../installation.md:36
 msgid "Configure software environment"
 msgstr "配置软件环境"

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Before using pip"
-msgstr "使用 pip 前"
+msgstr "在使用 pip 之前"

-#: ../../source/installation.md:49
+#: ../../installation.md:46
 msgid ""
 "The easiest way to prepare your software environment is using CANN image "
 "directly:"
-msgstr "准备软件环境最简单的方法是直接使用 CANN 镜像："
+msgstr "最简单的方式是直接使用 CANN 镜像来准备您的软件环境："

-#: ../../source/installation.md:52
-msgid ""
-"The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration"
-" Library), which provides libatb.so for advanced tensor operations. No "
-"additional installation is required when using the prebuilt image."
-msgstr "CANN 预构建镜像包含 NNAL（昇腾神经网络加速库），它提供了用于高级张量运算的 libatb.so。使用预构建镜像时无需额外安装。"
-
-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Click here to see \"Install CANN manually\""
 msgstr "点击此处查看“手动安装 CANN”"

-#: ../../source/installation.md:80
+#: ../../installation.md:72
 msgid "You can also install CANN manually:"
-msgstr "您也可以手动安装 CANN："
+msgstr "你也可以手动安装 CANN："

-#: ../../source/installation.md:83
-msgid ""
-"If you encounter \"libatb.so not found\" errors during runtime, please "
-"ensure NNAL is properly installed as shown in the manual installation "
-"steps below."
-msgstr "如果在运行时遇到“libatb.so not found”错误，请确保 NNAL 已正确安装，如下方手动安装步骤所示。"
-
-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Before using docker"
-msgstr "使用 docker 前"
+msgstr "在使用 docker 之前"

-#: ../../source/installation.md:115
+#: ../../installation.md:104
 msgid ""
-"No extra steps are needed if you are using the `vllm-ascend` prebuilt "
-"Docker image."
-msgstr "如果您使用 `vllm-ascend` 预构建的 Docker 镜像，则无需额外步骤。"
+"No more extra step if you are using `vllm-ascend` prebuilt docker image."
+msgstr "如果你使用 `vllm-ascend` 预构建的 docker 镜像，就无需额外的步骤。"

-#: ../../source/installation.md:119
-msgid "Once this is done, you can start to set up `vllm` and `vllm-ascend`."
-msgstr "完成此步骤后，您就可以开始设置 `vllm` 和 `vllm-ascend`。"
+#: ../../installation.md:108
+msgid "Once it's done, you can start to set up `vllm` and `vllm-ascend`."
+msgstr "完成后，你可以开始配置 `vllm` 和 `vllm-ascend`。"

-#: ../../source/installation.md:121
-msgid "Set up using Python"
-msgstr "使用 Python 设置"
+#: ../../installation.md:110
+msgid "Setup vllm and vllm-ascend"
+msgstr "安装 vllm 和 vllm-ascend"

-#: ../../source/installation.md:123
-msgid "First, install system dependencies and configure the pip mirror:"
-msgstr "首先，安装系统依赖项并配置 pip 镜像："
+#: ../../installation.md
+msgid "Using pip"
+msgstr "使用 pip"

-#: ../../source/installation.md:135
+#: ../../installation.md:121
+msgid "First install system dependencies and config pip mirror:"
+msgstr "首先安装系统依赖并配置 pip 镜像："
+
+#: ../../installation.md:133
 msgid ""
-"**[Optional]** Then configure the extra-index of `pip` if you are working"
-" on an x86 machine or using torch-npu dev version:"
-msgstr "**[可选]** 如果您在 x86 机器上工作或使用 torch-npu 开发版本，请配置 `pip` 的额外索引："
+"**[Optional]** Then config the extra-index of `pip` if you are working on a "
+"x86 machine or using torch-npu dev version:"
+msgstr ""
+"**[可选]** 如果你在 x86 机器上工作或使用 torch-npu 开发版，请配置 `pip` 的额"
+"外索引："

-#: ../../source/installation.md:142
-msgid "Then you can install `vllm` and `vllm-ascend` from a **pre-built wheel**:"
-msgstr "然后，您可以从 **预构建的 wheel 包** 安装 `vllm` 和 `vllm-ascend`："
+#: ../../installation.md:140
+msgid ""
+"Then you can install `vllm` and `vllm-ascend` from **pre-built wheel**:"
+msgstr "然后你可以从**预编译的 wheel 包**安装 `vllm` 和 `vllm-ascend`："

-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Click here to see \"Build from source code\""
 msgstr "点击此处查看“从源代码构建”"

-#: ../../source/installation.md:155
+#: ../../installation.md:153
 msgid "or build from **source code**:"
-msgstr "或从 **源代码** 构建："
+msgstr "或者从**源代码**构建："

-#: ../../source/installation.md:174
+#: ../../installation.md:171
 msgid ""
-"If you are building custom operators for Atlas A3, you should run `git "
-"submodule update --init --recursive` manually, or ensure your environment"
-" has internet access."
+"vllm-ascend will build custom ops by default. If you don't want to build "
+"it, set `COMPILE_CUSTOM_KERNELS=0` environment to disable it."
 msgstr ""
-"如果您正在为 Atlas A3 构建自定义算子，您应该手动运行 `git submodule update --init "
-"--recursive`，或确保您的环境可以访问互联网。"
+"vllm-ascend 默认会编译自定义算子。如果你不想编译它，可以设置环境变量 "
+"`COMPILE_CUSTOM_KERNELS=0` 来禁用。"

-#: ../../source/installation.md:178
+#: ../../installation.md:175
 msgid ""
-"To build custom operators, gcc/g++ higher than 8 and C++17 or higher are "
-"required. If you are using `pip install -e .` and encounter a torch-npu "
-"version conflict, please install with `pip install --no-build-isolation "
-"-e .` to build on system env. If you encounter other problems during "
-"compiling, it is probably because an unexpected compiler is being used, "
-"you may export `CXX_COMPILER` and `C_COMPILER` in the environment to "
-"specify your g++ and gcc locations before compiling."
+"If you are building from v0.7.3-dev and intend to use sleep mode feature, "
+"you should set `COMPILE_CUSTOM_KERNELS=1` manually. To build custom ops, "
+"gcc/g++ higher than 8 and c++ 17 or higher is required. If you're using "
+"`pip install -e .` and encourage a torch-npu version conflict, please "
+"install with `pip install --no-build-isolation -e .` to build on system "
+"env. If you encounter other problems during compiling, it is probably "
+"because unexpected compiler is being used, you may export `CXX_COMPILER` "
+"and `C_COMPILER` in env to specify your g++ and gcc locations before "
+"compiling."
 msgstr ""
-"构建自定义算子需要 gcc/g++ 版本高于 8 且支持 C++17 或更高标准。如果您使用 `pip install -e .` 并遇到 "
-"torch-npu 版本冲突，请使用 `pip install --no-build-isolation "
-"-e .` 在系统环境中进行安装。如果在编译过程中遇到其他问题，可能是因为使用了非预期的编译器，您可以在编译前通过环境变量导出 `CXX_COMPILER` "
-"和 `C_COMPILER` 来指定您的 g++ 和 gcc 路径。"
+"如果你是从 v0.7.3-dev 版本开始构建，并且打算使用休眠模式功能，你需要手动设"
+"置 `COMPILE_CUSTOM_KERNELS=1`。构建自定义算子时，要求 gcc/g++ 版本高于 8 且"
+"支持 c++ 17 或更高标准。如果你正在使用 `pip install -e .` 并且出现了 torch-"
+"npu 版本冲突，请使用 `pip install --no-build-isolation -e .` 在系统环境下进"
+"行安装。如果在编译过程中遇到其它问题，可能是因为使用了非预期的编译器，你可以"
+"在编译前通过环境变量导出 `CXX_COMPILER` 和 `C_COMPILER`，以指定你的 g++ 和 "
+"gcc 路径。"

-#: ../../source/installation.md:181
-msgid ""
-"If you are building in a CPU-only environment where `npu-smi` is "
-"unavailable, you need to set `SOC_VERSION` before `pip install -e .` so "
-"the build can target the correct chip. You can refer to `Dockerfile*` "
-"defaults, for example:"
-msgstr ""
-"如果您在仅 CPU 的环境中构建，且 `npu-smi` 不可用，则需要在 `pip install -e .` 之前设置 "
-"`SOC_VERSION`，以便构建过程能针对正确的芯片。您可以参考 `Dockerfile*` 的默认值，例如："
+#: ../../installation.md
+msgid "Using docker"
+msgstr "使用 docker"

-#: ../../source/installation.md:183
-msgid "Atlas A2: `export SOC_VERSION=ascend910b1`"
-msgstr "Atlas A2：`export SOC_VERSION=ascend910b1`"
+#: ../../installation.md:184
+msgid "You can just pull the **prebuilt image** and run it with bash."
+msgstr "你可以直接拉取**预构建镜像**并用 bash 运行它。"

-#: ../../source/installation.md:184
-msgid "Atlas A3: `export SOC_VERSION=ascend910_9391`"
-msgstr "Atlas A3：`export SOC_VERSION=ascend910_9391`"
-
-#: ../../source/installation.md:185
-msgid "Atlas 300I: `export SOC_VERSION=ascend310p1`"
-msgstr "Atlas 300I：`export SOC_VERSION=ascend310p1`"
-
-#: ../../source/installation.md:186
-msgid "Atlas A5: `export SOC_VERSION=<value starting with \"ascend950\">`"
-msgstr "Atlas A5：`export SOC_VERSION=<以 \"ascend950\" 开头的值>`"
-
-#: ../../source/installation.md:189
-msgid "Set up using Docker"
-msgstr "使用 Docker 设置"
-
-#: ../../source/installation.md:191
-msgid ""
-"`vllm-ascend` offers Docker images for deployment. You can just pull the "
-"**prebuilt image** from the image repository [ascend/vllm-"
-"ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and run "
-"it with bash."
-msgstr ""
-"`vllm-ascend` 提供用于部署的 Docker 镜像。您可以直接从镜像仓库 [ascend/vllm-"
-"ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) 拉取 "
-"**预构建镜像** 并使用 bash 运行。"
-
-#: ../../source/installation.md:193
-msgid "Supported images as following."
-msgstr "支持的镜像如下。"
-
-#: ../../source/installation.md:177
-msgid "image name"
-msgstr "镜像名称"
-
-#: ../../source/installation.md:177
-msgid "Hardware"
-msgstr "硬件"
-
-#: ../../source/installation.md:177
-msgid "OS"
-msgstr "操作系统"
-
-#: ../../source/installation.md:177
-msgid "vllm-ascend:{{ vllm_ascend_version }}"
-msgstr "vllm-ascend:{{ vllm_ascend_version }}"
-
-#: ../../source/installation.md:177
-msgid "Atlas A2"
-msgstr "Atlas A2"
-
-#: ../../source/installation.md:177
-msgid "Ubuntu"
-msgstr "Ubuntu"
-
-#: ../../source/installation.md:177
-msgid "vllm-ascend:{{ vllm_ascend_version }}-openeuler"
-msgstr "vllm-ascend:{{ vllm_ascend_version }}-openeuler"
-
-#: ../../source/installation.md:177
-msgid "openEuler"
-msgstr "openEuler"
-
-#: ../../source/installation.md:177
-msgid "vllm-ascend:{{ vllm_ascend_version }}-a3"
-msgstr "vllm-ascend:{{ vllm_ascend_version }}-a3"
-
-#: ../../source/installation.md:177
-msgid "Atlas A3"
-msgstr "Atlas A3"
-
-#: ../../source/installation.md:177
-msgid "vllm-ascend:{{ vllm_ascend_version }}-a3-openeuler"
-msgstr "vllm-ascend:{{ vllm_ascend_version }}-a3-openeuler"
-
-#: ../../source/installation.md:177
-msgid "vllm-ascend:{{ vllm_ascend_version }}-310p"
-msgstr "vllm-ascend:{{ vllm_ascend_version }}-310p"
-
-#: ../../source/installation.md:177
-msgid "Atlas 300I"
-msgstr "Atlas 300I"
-
-#: ../../source/installation.md:177
-msgid "vllm-ascend:{{ vllm_ascend_version }}-310p-openeuler"
-msgstr "vllm-ascend:{{ vllm_ascend_version }}-310p-openeuler"
-
-#: ../../source/installation.md
+#: ../../installation.md
 msgid "Click here to see \"Build from Dockerfile\""
 msgstr "点击这里查看“从 Dockerfile 构建”"

-#: ../../source/installation.md:205
+#: ../../installation.md:187
 msgid "or build IMAGE from **source code**:"
 msgstr "或从**源代码**构建 IMAGE："

-#: ../../source/installation.md:247
+#: ../../installation.md:218
 msgid ""
-"The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed"
-" in `/vllm-workspace` and installed in [development "
-"mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)"
-" (`pip install -e`) to help developers immediately make changes without "
-"requiring a new installation."
+"The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed "
+"in `/vllm-workspace` and installed in [development mode](https://setuptools."
+"pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to "
+"help developer immediately take place changes without requiring a new "
+"installation."
 msgstr ""
-"默认工作目录为 `/workspace`，vLLM 和 vLLM Ascend 代码位于 `/vllm-workspace` "
-"目录下，并以[开发模式](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)（`pip"
-" install -e`）安装，以便开发者能够即时应用更改，而无需重新安装。"
+"默认的工作目录是 `/workspace`，vLLM 和 vLLM Ascend 代码被放置在 `/vllm-"
+"workspace`，并以[开发模式](https://setuptools.pypa.io/en/latest/userguide/"
+"development_mode.html)（`pip install -e`）安装，以便开发者能够即时生效更改，"
+"而无需重新安装。"

-#: ../../source/installation.md:249
+#: ../../installation.md:222
 msgid "Extra information"
 msgstr "额外信息"

-#: ../../source/installation.md:251
+#: ../../installation.md:224
 msgid "Verify installation"
 msgstr "验证安装"

-#: ../../source/installation.md:253
+#: ../../installation.md:226
 msgid "Create and run a simple inference test. The `example.py` can be like:"
-msgstr "创建并运行一个简单的推理测试。`example.py` 内容示例如下："
+msgstr "创建并运行一个简单的推理测试。`example.py` 可以如下："

-#: ../../source/installation.md:278
+#: ../../installation.md:251
 msgid "Then run:"
 msgstr "然后运行："

-#: ../../source/installation.md:284
-msgid ""
-"If you encounter a connection error with Hugging Face (e.g., `We couldn't"
-" connect to 'https://huggingface.co' to load the files, and couldn't find"
-" them in the cached files.`), run the following commands to use "
-"ModelScope as an alternative:"
-msgstr ""
-"如果遇到 Hugging Face 连接错误（例如：`We couldn't connect to "
-"'https://huggingface.co' to load the files, and couldn't find them in the"
-" cached files.`），请运行以下命令以使用 ModelScope 作为替代方案："
-
-#: ../../source/installation.md:292
+#: ../../installation.md:259
 msgid "The output will be like:"
-msgstr "输出示例如下："
-
-#: ../../source/installation.md:316
-msgid "Multi-node Deployment"
-msgstr "多节点部署"
-
-#: ../../source/installation.md:318
-msgid "Verify Multi-Node Communication"
-msgstr "验证多节点通信"
-
-#: ../../source/installation.md:320
-msgid ""
-"First, check physical layer connectivity, then verify each node, and "
-"finally verify the inter-node connectivity."
-msgstr "首先，检查物理层连通性，然后验证每个节点，最后验证节点间连通性。"
-
-#: ../../source/installation.md:322
-msgid "Physical Layer Requirements"
-msgstr "物理层要求"
-
-#: ../../source/installation.md:324
-msgid ""
-"The physical machines must be located on the same WLAN, with network "
-"connectivity."
-msgstr "物理机必须位于同一无线局域网（WLAN）内，并具备网络连通性。"
-
-#: ../../source/installation.md:325
-msgid ""
-"All NPUs are connected with optical modules, and the connection status "
-"must be normal."
-msgstr "所有 NPU 均通过光模块连接，且连接状态必须正常。"
-
-#: ../../source/installation.md:327
-msgid "Each Node Verification"
-msgstr "单节点验证"
-
-#: ../../source/installation.md:329
-msgid ""
-"Execute the following commands on each node in sequence. The results must"
-" all be `success` and the status must be `UP`:"
-msgstr "在每个节点上依次执行以下命令。所有结果必须为 `success`，状态必须为 `UP`："
-
-#: ../../source/installation.md
-msgid "A2 series"
-msgstr "A2 系列"
-
-#: ../../source/installation.md
-msgid "A3 series"
-msgstr "A3 系列"
-
-#: ../../source/installation.md:374
-msgid "Interconnect Verification"
-msgstr "互连验证"
-
-#: ../../source/installation.md:376
-msgid "1. Get NPU IP Addresses"
-msgstr "1.获取 NPU IP 地址"
-
-#: ../../source/installation.md:399
-msgid "2. Cross-Node PING Test"
-msgstr "2.跨节点 PING 测试"
-
-#: ../../source/installation.md:406
-msgid "Run Container In Each Node"
-msgstr "在每个节点中运行容器"
-
-#: ../../source/installation.md:408
-msgid ""
-"Using vLLM-ascend official container is more efficient to run multi-node "
-"environment."
-msgstr "使用 vLLM-ascend 官方容器运行多节点环境更为高效。"
-
-#: ../../source/installation.md:410
-msgid ""
-"Run the following command to start the container in each node (You should"
-" download the weight to /root/.cache in advance):"
-msgstr "在每个节点中运行以下命令以启动容器（您应提前将权重下载到 /root/.cache 目录）："
-
-#~ msgid ">= 8.1.RC1"
-#~ msgstr ">= 8.1.RC1"
-
-#~ msgid ">= 2.5.1.post1.dev20250619"
-#~ msgstr ">= 2.5.1.post1.dev20250619"
-
-#~ msgid "You have 2 way to install:"
-#~ msgstr "您有两种安装方式："
-
-#~ msgid "Setup vllm and vllm-ascend"
-#~ msgstr "安装 vllm 和 vllm-ascend"
-
-#~ msgid "Using pip"
-#~ msgstr "使用 pip"
-
-#~ msgid ""
-#~ "vllm-ascend will build custom ops "
-#~ "by default. If you don't want to"
-#~ " build it, set `COMPILE_CUSTOM_KERNELS=0` "
-#~ "environment to disable it."
-#~ msgstr ""
-#~ "vllm-ascend 默认会编译自定义算子。如果您不想编译它，可以设置环境变量 "
-#~ "`COMPILE_CUSTOM_KERNELS=0` 来禁用。"
-
-#~ msgid "You can just pull the **prebuilt image** and run it with bash."
-#~ msgstr "您可以直接拉取**预构建镜像**并用 bash 运行它。"
+msgstr "输出将会像这样："
--- a/docs/source/locale/zh_CN/LC_MESSAGES/quick_start.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/quick_start.po
@@ -4,150 +4,146 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
+#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version:  vllm-ascend\n"
+"Project-Id-Version: vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"POT-Creation-Date: 2025-07-18 09:01+0800\n"
 "PO-Revision-Date: 2025-07-18 10:09+0800\n"
 "Last-Translator: \n"
-"Language: zh_CN\n"
 "Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
+"Language: zh_CN\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"Generated-By: Babel 2.17.0\n"
+"X-Generator: Poedit 3.5\n"

-#: ../../source/quick_start.md:1
+#: ../../quick_start.md:1
 msgid "Quickstart"
 msgstr "快速入门"

-#: ../../source/quick_start.md:3
+#: ../../quick_start.md:3
 msgid "Prerequisites"
 msgstr "先决条件"

-#: ../../source/quick_start.md:5
+#: ../../quick_start.md:5
 msgid "Supported Devices"
 msgstr "支持的设备"

-#: ../../source/quick_start.md:7
+#: ../../quick_start.md:6
 msgid ""
-"Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 "
+"Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 "
 "Box16, Atlas 300T A2)"
 msgstr ""
-"Atlas A2 训练系列（Atlas 800T A2、Atlas 900 A2 PoD、Atlas 200T A2 Box16、Atlas "
-"300T A2）"
+"Atlas A2 训练系列（Atlas 800T A2，Atlas 900 A2 PoD，Atlas 200T A2 Box16，"
+"Atlas 300T A2）"

-#: ../../source/quick_start.md:8
-msgid "Atlas 800I A2 inference series (Atlas 800I A2)"
+#: ../../quick_start.md:7
+msgid "Atlas 800I A2 Inference series (Atlas 800I A2)"
 msgstr "Atlas 800I A2 推理系列（Atlas 800I A2）"

-#: ../../source/quick_start.md:9
-msgid ""
-"Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas "
-"9000 A3 SuperPoD)"
-msgstr ""
-"Atlas A3 训练系列（Atlas 800T A3、Atlas 900 A3 SuperPoD、Atlas 9000 A3 SuperPoD）"
-
-#: ../../source/quick_start.md:10
-msgid "Atlas 800I A3 inference series (Atlas 800I A3)"
-msgstr "Atlas 800I A3 推理系列（Atlas 800I A3）"
-
-#: ../../source/quick_start.md:11
-msgid "[Experimental] Atlas 300I inference series (Atlas 300I Duo)"
-msgstr "[实验性] Atlas 300I 推理系列（Atlas 300I Duo）"
-
-#: ../../source/quick_start.md:13
+#: ../../quick_start.md:9
 msgid "Setup environment using container"
 msgstr "使用容器设置环境"

-#: ../../source/quick_start.md
+#: ../../quick_start.md
 msgid "Ubuntu"
 msgstr "Ubuntu"

-#: ../../source/quick_start.md
+#: ../../quick_start.md
 msgid "openEuler"
 msgstr "openEuler"

-#: ../../source/quick_start.md:85
+#: ../../quick_start.md:69
 msgid ""
-"The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed"
-" in `/vllm-workspace` and installed in [development "
-"mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)"
-" (`pip install -e`) to help developers make changes effective immediately"
-" without requiring a new installation."
+"The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed "
+"in `/vllm-workspace` and installed in [development mode](https://setuptools."
+"pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to "
+"help developer immediately take place changes without requiring a new "
+"installation."
 msgstr ""
-"默认工作目录为 `/workspace`，vLLM 和 vLLM Ascend 代码位于 `/vllm-workspace` 目录下，并以[开发模式](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)（`pip install -e`）安装，以便开发者能够即时生效更改，而无需重新安装。"
+"默认的工作目录是 `/workspace`，vLLM 和 vLLM Ascend 代码被放置在 `/vllm-"
+"workspace`，并以[开发模式](https://setuptools.pypa.io/en/latest/userguide/"
+"development_mode.html)（`pip install -e`）安装，以便开发者能够即时生效更改，"
+"而无需重新安装。"

-#: ../../source/quick_start.md:87
+#: ../../quick_start.md:71
 msgid "Usage"
 msgstr "用法"

-#: ../../source/quick_start.md:89
-msgid "You can use ModelScope mirror to speed up download:"
-msgstr "您可以使用 ModelScope 镜像来加速下载："
+#: ../../quick_start.md:73
+msgid "You can use Modelscope mirror to speed up download:"
+msgstr "你可以使用 Modelscope 镜像来加速下载："

-#: ../../source/quick_start.md:97
+#: ../../quick_start.md:80
 msgid "There are two ways to start vLLM on Ascend NPU:"
 msgstr "在昇腾 NPU 上启动 vLLM 有两种方式："

-#: ../../source/quick_start.md
+#: ../../quick_start.md
 msgid "Offline Batched Inference"
 msgstr "离线批量推理"

-#: ../../source/quick_start.md:103
+#: ../../quick_start.md:86
 msgid ""
 "With vLLM installed, you can start generating texts for list of input "
-"prompts (i.e. offline batch inference)."
-msgstr "安装 vLLM 后，您可以开始为一系列输入提示生成文本（即离线批量推理）。"
+"prompts (i.e. offline batch inferencing)."
+msgstr ""
+"安装了 vLLM 后，您可以开始为一系列输入提示生成文本（即离线批量推理）。"

-#: ../../source/quick_start.md:105
+#: ../../quick_start.md:88
 msgid ""
-"Try to run below Python script directly or use `python3` shell to "
-"generate texts:"
-msgstr "尝试直接运行下面的 Python 脚本，或者使用 `python3` 交互式环境来生成文本："
+"Try to run below Python script directly or use `python3` shell to generate "
+"texts:"
+msgstr ""
+"尝试直接运行下面的 Python 脚本，或者使用 `python3` 交互式命令行来生成文本："

-#: ../../source/quick_start.md
+#: ../../quick_start.md
 msgid "OpenAI Completions API"
 msgstr "OpenAI Completions API"

-#: ../../source/quick_start.md:132
+#: ../../quick_start.md:114
 msgid ""
 "vLLM can also be deployed as a server that implements the OpenAI API "
-"protocol. Run the following command to start the vLLM server with the "
-"[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) model:"
+"protocol. Run the following command to start the vLLM server with the [Qwen/"
+"Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) "
+"model:"
 msgstr ""
-"vLLM 也可以部署为实现 OpenAI API 协议的服务器。运行以下命令，使用 [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) 模型启动 vLLM 服务器："
+"vLLM 也可以作为实现 OpenAI API 协议的服务器进行部署。运行以下命令，使用 "
+"[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-"
+"Instruct) 模型启动 vLLM 服务器："

-#: ../../source/quick_start.md:143
-msgid "If you see a log as below:"
-msgstr "如果您看到如下日志："
+#: ../../quick_start.md:124
+msgid "If you see log as below:"
+msgstr "如果你看到如下日志："

-#: ../../source/quick_start.md:152
+#: ../../quick_start.md:132
 msgid "Congratulations, you have successfully started the vLLM server!"
-msgstr "恭喜，您已成功启动 vLLM 服务器！"
+msgstr "恭喜，你已经成功启动了 vLLM 服务器！"

-#: ../../source/quick_start.md:154
-msgid "You can query the list of models:"
-msgstr "您可以查询模型列表："
+#: ../../quick_start.md:134
+msgid "You can query the list the models:"
+msgstr "你可以查询模型列表："

-#: ../../source/quick_start.md:162
+#: ../../quick_start.md:141
 msgid "You can also query the model with input prompts:"
-msgstr "您也可以通过输入提示来查询模型："
+msgstr "你也可以通过输入提示来查询模型："

-#: ../../source/quick_start.md:177
+#: ../../quick_start.md:155
 msgid ""
-"vLLM is serving as a background process, you can use `kill -2 $VLLM_PID` "
-"to stop the background process gracefully, which is similar to `Ctrl-C` "
-"for stopping the foreground vLLM process:"
+"vLLM is serving as background process, you can use `kill -2 $VLLM_PID` to "
+"stop the background process gracefully, it's equal to `Ctrl-C` to stop "
+"foreground vLLM process:"
 msgstr ""
-"vLLM 正作为后台进程运行，您可以使用 `kill -2 $VLLM_PID` 来优雅地停止后台进程，这类似于使用 `Ctrl-C` 停止前台 vLLM 进程："
+"vLLM 正作为后台进程运行，你可以使用 `kill -2 $VLLM_PID` 来优雅地停止后台进"
+"程，这等同于使用 `Ctrl-C` 停止前台 vLLM 进程："

-#: ../../source/quick_start.md:186
-msgid "The output is as below:"
-msgstr "输出如下："
+#: ../../quick_start.md:164
+msgid "You will see output as below:"
+msgstr "你将会看到如下输出："

-#: ../../source/quick_start.md:195
-msgid "Finally, you can exit the container by using `ctrl-D`."
-msgstr "最后，您可以通过按 `ctrl-D` 退出容器。"
+#: ../../quick_start.md:172
+msgid "Finally, you can exit container by using `ctrl-D`."
+msgstr "最后，你可以通过按 `ctrl-D` 退出容器。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/index.po
@@ -1,29 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/index.md:1
-#: ../../source/tutorials/features/index.md:5
-msgid "Feature Tutorials"
-msgstr "功能教程"
-
-#: ../../source/tutorials/features/index.md:3
-msgid "This section provides tutorials for different features of vLLM Ascend."
-msgstr "本节提供 vLLM Ascend 不同功能的使用教程。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po
@@ -1,485 +0,0 @@
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:1
-msgid "Long-Sequence Context Parallel (Deepseek)"
-msgstr "长序列上下文并行 (Deepseek)"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:3
-msgid "Getting Started"
-msgstr "快速开始"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:6
-msgid ""
-"Context parallel feature currently is only supported on Atlas A3 device, "
-"and will be supported on Atlas A2 in the future."
-msgstr "上下文并行特性目前仅在 Atlas A3 设备上受支持，未来将在 Atlas A2 上提供支持。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:9
-msgid ""
-"vLLM-Ascend now supports long sequence with context parallel options. "
-"This guide takes one-by-one steps to verify these features with "
-"constrained resources."
-msgstr "vLLM-Ascend 现已支持长序列上下文并行选项。本指南将逐步引导您在有限资源下验证这些功能。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:11
-msgid ""
-"Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 "
-"servers to deploy the “1P1D” architecture. Node p is deployed across "
-"multiple machines, while node d is deployed on a single machine. Assume "
-"the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 "
-"(prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each "
-"server, use 8 NPUs 16 chips to deploy one service instance. In the "
-"current example, we will enable the context parallel feature on node p to"
-" improve TTFT. Although enabling the DCP feature on node d can reduce "
-"memory usage, it would introduce additional communication and small "
-"operator overhead. Therefore, we will not enable the DCP feature on node "
-"d."
-msgstr ""
-"以 Deepseek-V3.1-w8a8 模型为例，使用 3 台 Atlas 800T A3 服务器部署“1P1D”架构。节点 p "
-"跨多台机器部署，而节点 d 部署在单台机器上。假设预填充服务器的 IP 为 192.0.0.1（预填充 1）和 192.0.0.2（预填充 "
-"2），解码器服务器为 192.0.0.3（解码器 1）。每台服务器使用 8 个 NPU（16 个芯片）部署一个服务实例。在当前示例中，我们将在节点"
-" p 上启用上下文并行特性以改善 TTFT。虽然在节点 d 上启用 DCP "
-"特性可以减少内存使用，但会引入额外的通信和小算子开销。因此，我们不会在节点 d 上启用 DCP 特性。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:13
-msgid "Environment Preparation"
-msgstr "环境准备"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:15
-msgid "Model Weight"
-msgstr "模型权重"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:17
-msgid ""
-"`DeepSeek-V3.1_w8a8mix_mtp` (Quantized version with mix mtp): [Download "
-"model weight](https://www.modelscope.cn/models/Eco-"
-"Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to "
-"`bfloat16` in `config.json`."
-msgstr ""
-"`DeepSeek-V3.1_w8a8mix_mtp`（混合 MTP "
-"量化版本）：[下载模型权重](https://www.modelscope.cn/models/Eco-"
-"Tech/DeepSeek-V3.1-w8a8)。请在 `config.json` 中将 `torch_dtype` 从 `float16` "
-"修改为 `bfloat16`。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:19
-msgid ""
-"It is recommended to download the model weight to the shared directory of"
-" multiple nodes, such as `/root/.cache/`"
-msgstr "建议将模型权重下载到多个节点的共享目录中，例如 `/root/.cache/`"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:21
-msgid "Verify Multi-node Communication"
-msgstr "验证多节点通信"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:23
-msgid ""
-"Refer to [verify multi-node communication "
-"environment](../../installation.md#verify-multi-node-communication) to "
-"verify multi-node communication."
-msgstr ""
-"请参考[验证多节点通信环境](../../installation.md#verify-multi-node-"
-"communication)来验证多节点通信。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:25
-msgid "Installation"
-msgstr "安装"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:27
-msgid "You can use our official Docker image to run `DeepSeek-V3.1` directly."
-msgstr "您可以使用我们的官方 Docker 镜像直接运行 `DeepSeek-V3.1`。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:29
-msgid ""
-"Select an image based on your machine type and start the Docker image on "
-"your node, refer to [using Docker](../../installation.md#set-up-using-"
-"docker)."
-msgstr ""
-"根据您的机器类型选择镜像并在节点上启动 Docker 镜像，请参考[使用 Docker](../../installation.md#set-"
-"up-using-docker)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:64
-msgid "You need to set up environment on each node."
-msgstr "您需要在每个节点上设置环境。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:66
-msgid "Prefiller/Decoder Deployment"
-msgstr "预填充器/解码器部署"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:68
-msgid ""
-"We can run the following scripts to launch a server on the "
-"prefiller/decoder node, respectively. Please note that each P/D node will"
-" occupy ports ranging from kv_port to kv_port + num_chips to initialize "
-"socket listeners. To avoid any issues, port conflicts should be "
-"prevented. Additionally, ensure that each node's engine_id is uniquely "
-"assigned to avoid conflicts."
-msgstr ""
-"我们可以分别在预填充器/解码器节点上运行以下脚本来启动服务器。请注意，每个 P/D 节点将占用从 kv_port 到 kv_port + "
-"num_chips 的端口范围来初始化 socket 监听器。为避免任何问题，应防止端口冲突。此外，请确保每个节点的 engine_id "
-"被唯一分配以避免冲突。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:70
-msgid ""
-"Run the following script to execute online 128k inference on three nodes "
-"respectively."
-msgstr "运行以下脚本，分别在三个节点上执行在线 128k 推理。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md
-msgid "Prefiller node 1"
-msgstr "预填充节点 1"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md
-msgid "Prefiller node 2"
-msgstr "预填充节点 2"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md
-msgid "Decoder node 1"
-msgstr "解码节点 1"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:276
-msgid "Prefill master node `proxy.sh` script"
-msgstr "预填充主节点 `proxy.sh` 脚本"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:292
-msgid "Run proxy"
-msgstr "运行代理"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:294
-msgid ""
-"Run a proxy server on the same node with the prefiller service instance. "
-"You can get the proxy program in the repository's examples: "
-"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
-"project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-msgstr ""
-"在与预填充服务实例相同的节点上运行代理服务器。您可以在仓库的示例中找到代理程序：[load_balance_proxy_server_example.py](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:301
-msgid "**Notice:** The parameters are explained as follows:"
-msgstr "**注意：** 参数解释如下："
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:304
-msgid ""
-"`--tensor-parallel-size` 16 are common settings for tensor parallelism "
-"(TP) sizes."
-msgstr "`--tensor-parallel-size` 16 是张量并行（TP）大小的常见设置。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:305
-msgid ""
-"`--prefill-context-parallel-size` 2 is common setting for prefill context"
-" parallelism (PCP) sizes."
-msgstr "`--prefill-context-parallel-size` 2 是预填充上下文并行（PCP）大小的常见设置。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:306
-msgid ""
-"`--decode-context-parallel-size` 8 are common settings for decode context"
-" parallelism (DCP) sizes."
-msgstr "`--decode-context-parallel-size` 8 是解码上下文并行（DCP）大小的常见设置。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:307
-msgid ""
-"`--max-model-len` represents the context length, which is the maximum "
-"value of the input plus output for a single request."
-msgstr "`--max-model-len` 表示上下文长度，即单个请求的输入加输出的最大值。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:308
-msgid ""
-"`--max-num-seqs` indicates the maximum number of requests that each DP "
-"group is allowed to process. If the number of requests sent to the "
-"service exceeds this limit, the excess requests will remain in a waiting "
-"state and will not be scheduled. Note that the time spent in the waiting "
-"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
-"testing performance, it is generally recommended that `--max-num-seqs` * "
-"`--data-parallel-size` >= the actual total concurrency."
-msgstr ""
-"`--max-num-seqs` 表示每个 DP "
-"组允许处理的最大请求数。如果发送到服务的请求数量超过此限制，超出的请求将保持在等待状态，不会被调度。请注意，在等待状态所花费的时间也会计入 "
-"TTFT 和 TPOT 等指标。因此，在测试性能时，通常建议 `--max-num-seqs` * `--data-parallel-size` "
-">= 实际总并发数。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:309
-msgid ""
-"`--max-num-batched-tokens` represents the maximum number of tokens that "
-"the model can process in a single step. Currently, vLLM v1 scheduling "
-"enables ChunkPrefill/SplitFuse by default, which means:"
-msgstr ""
-"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前，vLLM v1 调度默认启用 "
-"ChunkPrefill/SplitFuse，这意味着："
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:310
-msgid ""
-"(1) If the input length of a request is greater than `--max-num-batched-"
-"tokens`, it will be divided into multiple rounds of computation according"
-" to `--max-num-batched-tokens`;"
-msgstr ""
-"（1）如果请求的输入长度大于 `--max-num-batched-tokens`，它将根据 `--max-num-batched-tokens`"
-" 被分成多轮计算；"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:311
-msgid ""
-"(2) Decode requests are prioritized for scheduling, and prefill requests "
-"are scheduled only if there is available capacity."
-msgstr "（2）解码请求优先调度，预填充请求仅在有空闲容量时才会被调度。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:312
-msgid ""
-"Generally, if `--max-num-batched-tokens` is set to a larger value, the "
-"overall latency will be lower, but the pressure on GPU memory (activation"
-" value usage) will be greater."
-msgstr "通常，如果 `--max-num-batched-tokens` 设置得较大，整体延迟会更低，但 GPU 内存（激活值使用）的压力会更大。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:313
-msgid ""
-"`--gpu-memory-utilization` represents the proportion of HBM that vLLM "
-"will use for actual inference. Its essential function is to calculate the"
-" available kv_cache size. During the warm-up phase (referred to as "
-"profile run in vLLM), vLLM records the peak GPU memory usage during an "
-"inference process with an input size of `--max-num-batched-tokens`. The "
-"available kv_cache size is then calculated as: `--gpu-memory-utilization`"
-" * HBM size - peak GPU memory usage. Therefore, the larger the value of "
-"`--gpu-memory-utilization`, the more kv_cache can be used. However, since"
-" the GPU memory usage during the warm-up phase may differ from that "
-"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
-"memory-utilization` too high may lead to OOM (Out of Memory) issues "
-"during actual inference. The default value is `0.9`."
-msgstr ""
-"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
-"大小。在预热阶段（vLLM 中称为 profile run），vLLM 会记录输入大小为 `--max-num-batched-tokens` "
-"的推理过程中的峰值 GPU 内存使用量。然后，可用的 kv_cache 大小计算为：`--gpu-memory-utilization` * "
-"HBM 大小 - 峰值 GPU 内存使用量。因此，`--gpu-memory-utilization` 的值越大，可用的 kv_cache "
-"就越多。然而，由于预热阶段的 GPU 内存使用量可能与实际推理期间不同（例如，由于 EP 负载不均），将 `--gpu-memory-"
-"utilization` 设置得过高可能导致实际推理时出现 OOM（内存不足）问题。默认值为 `0.9`。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:314
-msgid ""
-"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
-"does not support a mixed approach of ETP and EP; that is, MoE can either "
-"use pure EP or pure TP."
-msgstr ""
-"`--enable-expert-parallel` 表示启用了 EP。请注意，vLLM 不支持 ETP 和 EP 的混合方法；也就是说，MoE "
-"只能使用纯 EP 或纯 TP。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:315
-msgid ""
-"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
-"To enable it, remove this option."
-msgstr "`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它，请移除此选项。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:316
-msgid ""
-"`--quantization` \"ascend\" indicates that quantization is used. To "
-"disable quantization, remove this option."
-msgstr "`--quantization` \"ascend\" 表示使用了量化。要禁用量化，请移除此选项。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:317
-msgid ""
-"`--compilation-config` contains configurations related to the aclgraph "
-"graph mode. The most significant configurations are \"cudagraph_mode\" "
-"and \"cudagraph_capture_sizes\", which have the following meanings: "
-"\"cudagraph_mode\": represents the specific graph mode. Currently, "
-"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
-"mainly used to reduce the cost of operator dispatch. Currently, "
-"\"FULL_DECODE_ONLY\" is recommended."
-msgstr ""
-"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
-" \"cudagraph_capture_sizes\"，其含义如下：\"cudagraph_mode\"：表示特定的图模式。目前支持 "
-"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
-"\"FULL_DECODE_ONLY\"。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:319
-msgid ""
-"\"cudagraph_capture_sizes\": represents different levels of graph modes. "
-"The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. "
-"In the graph mode, the input for graphs at different levels is fixed, and"
-" inputs between levels are automatically padded to the next level. "
-"Currently, the default setting is recommended. Only in some scenarios is "
-"it necessary to set this separately to achieve optimal performance."
-msgstr ""
-"\"cudagraph_capture_sizes\"：表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
-"40,..., `--max-num-"
-"seqs`]。在图模式下，不同级别图的输入是固定的，级别之间的输入会自动填充到下一级别。目前推荐使用默认设置。仅在部分场景中，需要单独设置此参数以达到最佳性能。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:320
-msgid ""
-"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
-"optimization is enabled. Currently, this optimization is only supported "
-"for MoE in scenarios where tensor-parallel-size > 1."
-msgstr ""
-"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前，此优化仅在 "
-"tensor-parallel-size > 1 的场景下对 MoE 提供支持。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:321
-msgid ""
-"`export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context "
-"parallel is enabled. This environment variable is required in the PD "
-"architecture but not needed in the PD co-locate deployment scenario. It "
-"will be removed in the future."
-msgstr ""
-"`export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` 表示启用了上下文并行。此环境变量在 PD "
-"架构中是必需的，但在 PD 共置部署场景中不需要。未来将被移除。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:323
-msgid "**Notice:**"
-msgstr "**注意：**"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:325
-msgid ""
-"tensor-parallel-size needs to be divisible by decode-context-parallel-"
-"size."
-msgstr "tensor-parallel-size 需要能被 decode-context-parallel-size 整除。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:326
-msgid ""
-"decode-context-parallel-size must be less than or equal to tensor-"
-"parallel-size."
-msgstr "decode-context-parallel-size 必须小于或等于 tensor-parallel-size。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:328
-msgid "Accuracy Evaluation"
-msgstr "精度评估"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:330
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:342
-msgid "Using AISBench"
-msgstr "使用 AISBench"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:332
-msgid ""
-"Refer to [Using "
-"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
-"details."
-msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:334
-msgid ""
-"After execution, you can get the result, here is the result of "
-"`DeepSeek-V3.1-w8a8` for reference only."
-msgstr "执行后，您可以获得结果，以下是 `DeepSeek-V3.1-w8a8` 的结果，仅供参考。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "dataset"
-msgstr "数据集"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "version"
-msgstr "版本"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "metric"
-msgstr "指标"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "mode"
-msgstr "模式"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "vllm-api-general-chat"
-msgstr "vllm-api-general-chat"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "aime2024"
-msgstr "aime2024"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "-"
-msgstr "-"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "accuracy"
-msgstr "准确率"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "gen"
-msgstr "生成"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "86.67"
-msgstr "86.67"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:340
-msgid "Performance"
-msgstr "性能"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:344
-msgid ""
-"Refer to [Using AISBench for performance "
-"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
-"performance-evaluation) for details."
-msgstr ""
-"详情请参阅[使用 AISBench "
-"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
-"performance-evaluation)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:346
-msgid "Using vLLM Benchmark"
-msgstr "使用 vLLM 基准测试"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:348
-msgid "Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example."
-msgstr "以运行 `DeepSeek-V3.1-w8a8` 的性能评估为例。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:350
-msgid ""
-"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
-"for more details."
-msgstr "更多详情请参阅 [vllm 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:352
-msgid "There are three `vllm bench` subcommands:"
-msgstr "`vllm bench` 包含三个子命令："
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:354
-msgid "`latency`: Benchmark the latency of a single batch of requests."
-msgstr "`latency`：对单批请求的延迟进行基准测试。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:355
-msgid "`serve`: Benchmark the online serving throughput."
-msgstr "`serve`：对在线服务吞吐量进行基准测试。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:356
-msgid "`throughput`: Benchmark offline inference throughput."
-msgstr "`throughput`：对离线推理吞吐量进行基准测试。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:358
-msgid "Take the `serve` as an example. Run the code as follows."
-msgstr "以 `serve` 为例，按如下方式运行代码。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:365
-msgid ""
-"After about several minutes, you can get the performance evaluation "
-"result."
-msgstr "大约几分钟后，您将获得性能评估结果。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "ttft"
-msgstr "首字元延迟"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "random"
-msgstr "随机"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "performance"
-msgstr "性能"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "perf"
-msgstr "性能"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:211
-msgid "20.7s"
-msgstr "20.7秒"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po
@@ -1,425 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-15 09:41+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:1
-msgid "Long-Sequence Context Parallel (Qwen3-235B-A22B)"
-msgstr "长序列上下文并行 (Qwen3-235B-A22B)"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:3
-msgid "Getting Started"
-msgstr "快速开始"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:5
-msgid ""
-"vLLM-Ascend now supports long-sequence context parallel. This guide takes"
-" one-by-one steps to verify these features with constrained resources."
-msgstr "vLLM-Ascend 现已支持长序列上下文并行。本指南将引导您在使用有限资源的情况下，逐步验证这些功能。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:7
-msgid ""
-"Using the `Qwen3-235B-A22B-w8a8` (Quantized version) model as an example,"
-" use 1 Atlas 800 A3 (64G × 16) server to deploy the single node \"pd co-"
-"locate\" architecture."
-msgstr ""
-"以 `Qwen3-235B-A22B-w8a8`（量化版本）模型为例，使用 1 台 Atlas 800 A3（64G × 16）服务器部署单节点 "
-"\"pd co-locate\" 架构。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:9
-msgid "Environment Preparation"
-msgstr "环境准备"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:11
-msgid "Model Weight"
-msgstr "模型权重"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:13
-msgid ""
-"`Qwen3-235B-A22B-w8a8` (Quantized version): requires 1 Atlas 800 A3 (64G "
-"× 16) node. [Download model weight](https://modelscope.cn/models/vllm-"
-"ascend/Qwen3-235B-A22B-W8A8)"
-msgstr ""
-"`Qwen3-235B-A22B-w8a8`（量化版本）：需要 1 个 Atlas 800 A3（64G × "
-"16）节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-"
-"W8A8)"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:15
-msgid ""
-"It is recommended to download the model weight to the shared directory of"
-" multiple nodes, such as `/root/.cache/`"
-msgstr "建议将模型权重下载到多节点的共享目录，例如 `/root/.cache/`"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:17
-msgid "Run with Docker"
-msgstr "使用 Docker 运行"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:19
-msgid "Start a Docker container on each node."
-msgstr "在每个节点上启动一个 Docker 容器。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "dataset"
-msgstr "数据集"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "version"
-msgstr "版本"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "metric"
-msgstr "指标"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "mode"
-msgstr "模式"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "vllm-api-general-chat"
-msgstr "vllm-api-general-chat"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "aime2024"
-msgstr "aime2024"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "-"
-msgstr "-"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "accuracy"
-msgstr "准确率"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "gen"
-msgstr "生成"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:63
-msgid "Deployment"
-msgstr "部署"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:65
-msgid "Single-node Deployment"
-msgstr "单节点部署"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:67
-msgid ""
-"`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A3（64G*16）. "
-"Quantized version needs to start with parameter `--quantization ascend`."
-msgstr ""
-"`Qwen3-235B-A22B-w8a8` 可以部署在 1 台 Atlas 800 A3（64G*16）上。量化版本需要使用参数 "
-"`--quantization ascend` 启动。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:70
-msgid "Run the following script to execute online 128k inference."
-msgstr "运行以下脚本以执行在线 128k 推理。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:106
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:131
-msgid "**Notice:**"
-msgstr "**注意：**"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:108
-#, python-brace-format
-msgid ""
-"for vllm version below `v0.12.0` use parameter: `--rope-scaling "
-"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
-" \\`"
-msgstr ""
-"对于 vllm 版本低于 `v0.12.0`，使用参数：`--rope-scaling "
-"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
-" \\`"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:109
-#, python-brace-format
-msgid ""
-"for vllm version same as or newer than `v0.12.0` use parameter: `--hf-overrides "
-"'{\"rope_parameters\": "
-"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
-" \\`"
-msgstr ""
-"对于 vllm 版本 `v0.12.0`及以上，使用参数：`--hf-overrides '{\"rope_parameters\": "
-"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
-" \\`"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:111
-msgid "The parameters are explained as follows:"
-msgstr "参数解释如下："
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:113
-msgid ""
-"`--tensor-parallel-size` 8 are common settings for tensor parallelism "
-"(TP) sizes."
-msgstr "`--tensor-parallel-size` 8 是张量并行（TP）大小的常见设置。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:114
-msgid ""
-"`--prefill-context-parallel-size` 2 are common settings for prefill "
-"context parallelism (PCP) sizes."
-msgstr "`--prefill-context-parallel-size` 2 是预填充上下文并行（PCP）大小的常见设置。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:115
-msgid ""
-"`--decode-context-parallel-size` 2 are common settings for decode context"
-" parallelism (DCP) sizes."
-msgstr "`--decode-context-parallel-size` 2 是解码上下文并行（DCP）大小的常见设置。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:116
-msgid ""
-"`--max-model-len` represents the context length, which is the maximum "
-"value of the input plus output for a single request."
-msgstr "`--max-model-len` 表示上下文长度，即单个请求的输入加输出的最大值。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:117
-msgid ""
-"`--max-num-seqs` indicates the maximum number of requests that each DP "
-"group is allowed to process. If the number of requests sent to the "
-"service exceeds this limit, the excess requests will remain in a waiting "
-"state and will not be scheduled. Note that the time spent in the waiting "
-"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
-"testing performance, it is generally recommended that `--max-num-seqs` * "
-"`--data-parallel-size` >= the actual total concurrency."
-msgstr ""
-"`--max-num-seqs` 表示每个 DP "
-"组允许处理的最大请求数。如果发送到服务的请求数量超过此限制，超出的请求将保持在等待状态，不会被调度。请注意，在等待状态所花费的时间也会计入 "
-"TTFT 和 TPOT 等指标。因此，在测试性能时，通常建议 `--max-num-seqs` * `--data-parallel-size` "
-">= 实际总并发数。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:118
-msgid ""
-"`--max-num-batched-tokens` represents the maximum number of tokens that "
-"the model can process in a single step. Currently, vLLM v1 scheduling "
-"enables ChunkPrefill/SplitFuse by default, which means:"
-msgstr ""
-"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前，vLLM v1 调度默认启用 "
-"ChunkPrefill/SplitFuse，这意味着："
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:119
-msgid ""
-"(1) If the input length of a request is greater than `--max-num-batched-"
-"tokens`, it will be divided into multiple rounds of computation according"
-" to `--max-num-batched-tokens`;"
-msgstr ""
-"（1）如果请求的输入长度大于 `--max-num-batched-tokens`，它将根据 `--max-num-batched-tokens`"
-" 被分成多轮计算；"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:120
-msgid ""
-"(2) Decode requests are prioritized for scheduling, and prefill requests "
-"are scheduled only if there is available capacity."
-msgstr "（2）解码请求优先调度，预填充请求仅在有空闲容量时才会被调度。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:121
-msgid ""
-"Generally, if `--max-num-batched-tokens` is set to a larger value, the "
-"overall latency will be lower, but the pressure on GPU memory (activation"
-" value usage) will be greater."
-msgstr "通常，如果 `--max-num-batched-tokens` 设置得较大，整体延迟会更低，但 GPU 内存（激活值使用）的压力会更大。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:122
-msgid ""
-"`--gpu-memory-utilization` represents the proportion of HBM that vLLM "
-"will use for actual inference. Its essential function is to calculate the"
-" available kv_cache size. During the warm-up phase (referred to as "
-"profile run in vLLM), vLLM records the peak GPU memory usage during an "
-"inference process with an input size of `--max-num-batched-tokens`. The "
-"available kv_cache size is then calculated as: `--gpu-memory-utilization`"
-" * HBM size - peak GPU memory usage. Therefore, the larger the value of "
-"`--gpu-memory-utilization`, the more kv_cache can be used. However, since"
-" the GPU memory usage during the warm-up phase may differ from that "
-"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
-"memory-utilization` too high may lead to OOM (Out of Memory) issues "
-"during actual inference. The default value is `0.9`."
-msgstr ""
-"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
-"大小。在预热阶段（vLLM 中称为 profile run），vLLM 会记录输入大小为 `--max-num-batched-tokens` "
-"的推理过程中的峰值 GPU 内存使用量。然后，可用的 kv_cache 大小计算为：`--gpu-memory-utilization` * "
-"HBM 大小 - 峰值 GPU 内存使用量。因此，`--gpu-memory-utilization` 的值越大，可用的 kv_cache "
-"就越多。然而，由于预热阶段的 GPU 内存使用量可能与实际推理时不同（例如，由于 EP 负载不均），将 `--gpu-memory-"
-"utilization` 设置得过高可能导致实际推理时出现 OOM（内存不足）问题。默认值为 `0.9`。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:123
-msgid ""
-"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
-"does not support a mixed approach of ETP and EP; that is, MoE can either "
-"use pure EP or pure TP."
-msgstr ""
-"`--enable-expert-parallel` 表示启用了 EP。请注意，vLLM 不支持 ETP 和 EP 的混合方法；也就是说，MoE "
-"要么使用纯 EP，要么使用纯 TP。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:124
-msgid ""
-"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
-"To enable it, remove this option."
-msgstr "`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它，请移除此选项。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:125
-msgid ""
-"`--quantization` \"ascend\" indicates that quantization is used. To "
-"disable quantization, remove this option."
-msgstr "`--quantization` \"ascend\" 表示使用了量化。要禁用量化，请移除此选项。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:126
-msgid ""
-"`--compilation-config` contains configurations related to the aclgraph "
-"graph mode. The most significant configurations are \"cudagraph_mode\" "
-"and \"cudagraph_capture_sizes\", which have the following meanings: "
-"\"cudagraph_mode\": represents the specific graph mode. Currently, "
-"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
-"mainly used to reduce the cost of operator dispatch. Currently, "
-"\"FULL_DECODE_ONLY\" is recommended."
-msgstr ""
-"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
-" \"cudagraph_capture_sizes\"，其含义如下：\"cudagraph_mode\"：表示具体的图模式。目前支持 "
-"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
-"\"FULL_DECODE_ONLY\"。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:128
-msgid ""
-"\"cudagraph_capture_sizes\": represents different levels of graph modes. "
-"The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. "
-"In the graph mode, the input for graphs at different levels is fixed, and"
-" inputs between levels are automatically padded to the next level. "
-"Currently, the default setting is recommended. Only in some scenarios is "
-"it necessary to set this separately to achieve optimal performance."
-msgstr ""
-"\"cudagraph_capture_sizes\"：表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
-"40,..., `--max-num-"
-"seqs`]。在图模式下，不同级别图的输入是固定的，级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。仅在部分场景中，需要单独设置此参数以达到最佳性能。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:129
-msgid ""
-"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
-"optimization is enabled. Currently, this optimization is only supported "
-"for MoE in scenarios where tp_size > 1."
-msgstr ""
-"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前，此优化仅在 "
-"tp_size > 1 的场景下对 MoE 支持。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:133
-msgid "tp_size needs to be divisible by dcp_size"
-msgstr "tp_size 需要能被 dcp_size 整除"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:134
-msgid ""
-"decode context parallel size must be less than or equal to max_dcp_size, "
-"where max_dcp_size = tensor_parallel_size // total_num_kv_heads."
-msgstr ""
-"解码上下文并行大小必须小于或等于 max_dcp_size，其中 max_dcp_size = tensor_parallel_size // "
-"total_num_kv_heads。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:136
-msgid "Accuracy Evaluation"
-msgstr "精度评估"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:138
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:150
-msgid "Using AISBench"
-msgstr "使用 AISBench"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:140
-msgid ""
-"Refer to [Using "
-"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
-"details."
-msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:142
-msgid ""
-"After execution, you can get the result, here is the result of `Qwen3"
-"-235B-A22B-w8a8` for reference only."
-msgstr "执行后，您可以获得结果，以下是 `Qwen3-235B-A22B-w8a8` 的结果，仅供参考。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "83.33"
-msgstr "83.33"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:148
-msgid "Performance"
-msgstr "性能"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:152
-msgid ""
-"Refer to [Using AISBench for performance "
-"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
-"performance-evaluation) for details."
-msgstr ""
-"详情请参阅[使用 AISBench "
-"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
-"performance-evaluation)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:154
-msgid "Using vLLM Benchmark"
-msgstr "使用 vLLM Benchmark"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:156
-msgid "Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example."
-msgstr "以运行 `Qwen3-235B-A22B-w8a8` 的性能评估为例。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:158
-msgid ""
-"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
-"for more details."
-msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/)。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:160
-msgid "There are three `vllm bench` subcommands:"
-msgstr "`vllm bench` 有三个子命令："
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:162
-msgid "`latency`: Benchmark the latency of a single batch of requests."
-msgstr "`latency`：对单批请求的延迟进行基准测试。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:163
-msgid "`serve`: Benchmark the online serving throughput."
-msgstr "`serve`：对在线服务吞吐量进行基准测试。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:164
-msgid "`throughput`: Benchmark offline inference throughput."
-msgstr "`throughput`：对离线推理吞吐量进行基准测试。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:166
-msgid "Take the `serve` as an example. Run the code as follows."
-msgstr "以 `serve` 为例。运行代码如下。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:173
-msgid ""
-"After about several minutes, you can get the performance evaluation "
-"result."
-msgstr "大约几分钟后，您将获得性能评估结果。"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "ttft"
-msgstr "首词元时间"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "random"
-msgstr "随机"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "performance"
-msgstr "性能"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "perf"
-msgstr "性能"
-
-#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
-msgid "17.36s"
-msgstr "17.36秒"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po
@@ -1,517 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:1
-msgid "PD-Colocated with Mooncake Multi-Instance"
-msgstr "PD 共置与 Mooncake 多实例"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:3
-msgid "Getting Started"
-msgstr "快速开始"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:5
-msgid ""
-"vLLM-Ascend now supports PD-colocated deployment with Mooncake features. "
-"This guide provides step-by-step instructions to test these features with"
-" constrained resources."
-msgstr "vLLM-Ascend 现已支持结合 Mooncake 功能的 PD 共置部署。本指南提供了在有限资源下测试这些功能的逐步说明。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:9
-msgid ""
-"Using the Qwen2.5-72B-Instruct model as an example, this guide "
-"demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two "
-"Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies "
-"4 NPU cards and uses PD-colocated deployment."
-msgstr ""
-"本指南以 Qwen2.5-72B-Instruct 模型为例，演示如何在两个 Atlas 800T A2 节点上使用 vllm-ascend "
-"v0.11.0（包含 vLLM v0.11.0）部署两个 vLLM 实例。每个实例占用 4 个 NPU 卡，并采用 PD 共置部署。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:14
-msgid "Verify Multi-Node Communication Environment"
-msgstr "验证多节点通信环境"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:16
-msgid "Physical Layer Requirements"
-msgstr "物理层要求"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:18
-msgid ""
-"The two Atlas 800T A2 nodes must be physically interconnected via a RoCE "
-"network. Without RoCE interconnection, cross-node KV Cache access "
-"performance will be significantly degraded."
-msgstr "两个 Atlas 800T A2 节点必须通过 RoCE 网络进行物理互连。若无 RoCE 互连，跨节点 KV Cache 访问性能将显著下降。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:21
-msgid ""
-"All NPU cards must communicate properly. Intra-node communication uses "
-"HCCS, while inter-node communication uses the RoCE network."
-msgstr "所有 NPU 卡必须能够正常通信。节点内通信使用 HCCS，节点间通信使用 RoCE 网络。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:24
-msgid "Verification Process"
-msgstr "验证流程"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:26
-msgid ""
-"The following process serves as a reference example. Please modify "
-"parameters such as IP addresses according to your actual environment."
-msgstr "以下流程作为参考示例。请根据您的实际环境修改 IP 地址等参数。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:29
-msgid "Single Node Verification:"
-msgstr "单节点验证："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:31
-msgid ""
-"Execute the following commands sequentially. The results must all be "
-"`success` and the status must be `UP`:"
-msgstr "依次执行以下命令。结果必须全部为 `success` 且状态必须为 `UP`："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:47
-msgid "Check NPU HCCN Configuration:"
-msgstr "检查 NPU HCCN 配置："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:49
-msgid ""
-"Ensure that the hccn.conf file exists in the environment. If using "
-"Docker, mount it into the container."
-msgstr "确保环境中存在 hccn.conf 文件。如果使用 Docker，请将其挂载到容器中。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:56
-msgid "Get NPU IP Addresses:"
-msgstr "获取 NPU IP 地址："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:62
-msgid "Cross-Node PING Test:"
-msgstr "跨节点 PING 测试："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:70
-msgid "Check NPU TLS Configuration"
-msgstr "检查 NPU TLS 配置"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:77
-msgid "Run with Docker"
-msgstr "使用 Docker 运行"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:79
-msgid "Start a Docker container on each node."
-msgstr "在每个节点上启动一个 Docker 容器。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:112
-msgid "(Optional) Install Mooncake"
-msgstr "（可选）安装 Mooncake"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:114
-msgid ""
-"Mooncake is pre-installed and functional in the v0.11.0 image. The "
-"following installation steps are optional."
-msgstr "Mooncake 在 v0.11.0 镜像中已预安装且功能正常。以下安装步骤是可选的。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:117
-msgid ""
-"Mooncake is the serving platform for Kimi, a leading LLM service provided"
-" by Moonshot AI. Installation and compilation guide: <https://github.com"
-"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>."
-msgstr ""
-"Mooncake 是 Kimi 的服务平台，Kimi 是由 Moonshot AI 提供的领先 LLM "
-"服务。安装和编译指南：<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file"
-"#build-and-use-binaries>。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:121
-msgid "First, obtain the Mooncake project using the following command:"
-msgstr "首先，使用以下命令获取 Mooncake 项目："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:129
-msgid "Install MPI:"
-msgstr "安装 MPI："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:135
-msgid "Install the relevant dependencies (Go installation is not required):"
-msgstr "安装相关依赖（无需安装 Go）："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:141
-msgid "Compile and install:"
-msgstr "编译并安装："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:151
-msgid "After installation, verify that Mooncake is installed correctly:"
-msgstr "安装后，验证 Mooncake 是否正确安装："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:160
-msgid "Start Mooncake Master Service"
-msgstr "启动 Mooncake Master 服务"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:162
-msgid ""
-"To start the Mooncake master service in one of the node containers, use "
-"the following command:"
-msgstr "要在其中一个节点容器中启动 Mooncake master 服务，请使用以下命令："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Parameter"
-msgstr "参数"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Value"
-msgstr "值"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Explanation"
-msgstr "说明"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "port"
-msgstr "端口"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "50088"
-msgstr "50088"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Port for the master service"
-msgstr "Master 服务端口"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "eviction_high_watermark_ratio"
-msgstr "驱逐高水位线比例"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "0.95"
-msgstr "0.95"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "High watermark ratio (95% threshold)"
-msgstr "高水位线比例（95% 阈值）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "eviction_ratio"
-msgstr "驱逐比例"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "0.05"
-msgstr "0.05"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Percentage to evict when full (5%)"
-msgstr "缓存满时驱逐的百分比（5%）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:179
-msgid "Create a Mooncake Configuration File Named mooncake.json"
-msgstr "创建名为 mooncake.json 的 Mooncake 配置文件"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:181
-msgid "The template for the mooncake.json file is as follows:"
-msgstr "mooncake.json 文件的模板如下："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "metadata_server"
-msgstr "元数据服务器"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "P2PHANDSHAKE"
-msgstr "P2PHANDSHAKE"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Point-to-point handshake mode"
-msgstr "点对点握手模式"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "protocol"
-msgstr "协议"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "ascend"
-msgstr "ascend"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Ascend proprietary protocol"
-msgstr "Ascend 专有协议"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "master_server_address"
-msgstr "主服务器地址"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "90.90.100.188:50088(for example)"
-msgstr "90.90.100.188:50088（示例）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Master server address"
-msgstr "主服务器地址"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "global_segment_size"
-msgstr "全局段大小"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "107374182400"
-msgstr "107374182400"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Size per segment (100 GB)"
-msgstr "每个段的大小（100 GB）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:200
-msgid "vLLM Instance Deployment"
-msgstr "vLLM 实例部署"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:202
-msgid ""
-"Create containers on both Node 1 and Node 2, and launch the Qwen2.5-72B-"
-"Instruct model service in each to test the reusability and performance of"
-" cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3]"
-" on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3]"
-" on the second server."
-msgstr ""
-"在节点 1 和节点 2 上分别创建容器，并在每个容器中启动 Qwen2.5-72B-Instruct 模型服务，以测试跨节点、跨实例 KV "
-"Cache 的可重用性和性能。实例 1 使用第一个 Atlas 800T A2 服务器上的 NPU 卡 [0-3]，而实例 2 "
-"使用第二个服务器上的卡 [0-3]。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:208
-msgid "Deploy Instance 1"
-msgstr "部署实例 1"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:210
-msgid ""
-"Replace file paths, host, and port parameters based on your actual "
-"environment configuration."
-msgstr "请根据您的实际环境配置替换文件路径、主机和端口参数。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:242
-msgid "Deploy Instance 2"
-msgstr "部署实例 2"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:244
-msgid ""
-"The deployment method for Instance 2 is identical to Instance 1. Simply "
-"modify the `--host` and `--port` parameters according to your Instance 2 "
-"configuration."
-msgstr "实例 2 的部署方法与实例 1 相同。只需根据您的实例 2 配置修改 `--host` 和 `--port` 参数。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:248
-msgid "Configuration Parameters"
-msgstr "配置参数"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "kv_connector"
-msgstr "kv_connector"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "MooncakeConnectorStoreV1"
-msgstr "MooncakeConnectorStoreV1"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Use StoreV1 version"
-msgstr "使用 StoreV1 版本"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "kv_role"
-msgstr "kv_role"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "kv_both"
-msgstr "kv_both"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Enable both produce and consume"
-msgstr "同时启用生产和消费"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "use_layerwise"
-msgstr "use_layerwise"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "false"
-msgstr "false"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Transfer entire cache (see note)"
-msgstr "传输整个缓存（参见备注）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "mooncake_rpc_port"
-msgstr "mooncake_rpc_port"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "0"
-msgstr "0"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Automatic port assignment"
-msgstr "自动端口分配"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "load_async"
-msgstr "load_async"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "true"
-msgstr "true"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Enable asynchronous loading"
-msgstr "启用异步加载"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "register_buffer"
-msgstr "register_buffer"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Required for PD-colocated mode"
-msgstr "PD 共置模式必需"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:259
-msgid "**Note on use_layerwise:**"
-msgstr "**关于 use_layerwise 的说明：**"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:261
-msgid ""
-"`false`: Transfer entire KV Cache (suitable for cross-node with "
-"sufficient bandwidth)"
-msgstr "`false`: 传输整个KV缓存（适用于跨节点且带宽充足的情况）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:263
-msgid ""
-"`true`: Layer-by-layer transfer (suitable for single-node memory "
-"constraints)"
-msgstr "`true`: 逐层传输（适用于单节点内存受限的情况）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:266
-msgid "Benchmark"
-msgstr "性能基准测试"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:268
-msgid ""
-"We recommend using the **AISBench** tool to assess performance. The test "
-"uses **Dataset A**, consisting of fully random data, with the following "
-"configuration:"
-msgstr "我们推荐使用 **AISBench** 工具进行性能评估。测试使用 **数据集A**，该数据集由完全随机的数据组成，配置如下："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:272
-msgid "Input/output tokens: 1024/10"
-msgstr "输入/输出令牌数：1024/10"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:273
-msgid "Total requests: 100"
-msgstr "总请求数：100"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:274
-msgid "Concurrency: 25"
-msgstr "并发数：25"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:276
-msgid "The test procedure consists of three steps:"
-msgstr "测试流程包含三个步骤："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:278
-msgid "Step 1: Baseline (No Cache)"
-msgstr "步骤 1：基准测试（无缓存）"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:280
-msgid ""
-"Send Dataset A to Instance 1 on Node 1 and record the Time to First Token"
-" (TTFT) as **TTFT1**."
-msgstr "将数据集A发送到节点1上的实例1，并记录首令牌时间（TTFT）为 **TTFT1**。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:283
-msgid "Preparation for Step 2"
-msgstr "步骤 2 的准备工作"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:285
-msgid ""
-"Before Step 2, send a fully random Dataset B to Instance 1. Due to the "
-"unified on-chip memory/DRAM KV Cache with LRU (Least Recently Used) "
-"eviction policy, Dataset B's cache evicts Dataset A's cache from on-chip "
-"memory, leaving Dataset A's cache only in Node 1's DRAM."
-msgstr "在步骤2之前，向实例1发送一个完全随机的数据集B。由于采用了具有LRU（最近最少使用）淘汰策略的统一HBM/DRAM KV缓存，数据集B的缓存会将数据集A的缓存从HBM中淘汰，使得数据集A的缓存仅保留在节点1的DRAM中。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:290
-msgid "Step 2: Local DRAM Hit"
-msgstr "步骤 2：本地DRAM命中"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:292
-msgid ""
-"Send Dataset A to Instance 1 again to measure the performance when "
-"hitting the KV Cache in local DRAM. Record the TTFT as **TTFT2**."
-msgstr "再次将数据集A发送到实例1，以测量命中本地DRAM中KV缓存时的性能。记录TTFT为 **TTFT2**。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:295
-msgid "Step 3: Cross-Node DRAM Hit"
-msgstr "步骤 3：跨节点DRAM命中"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:297
-msgid ""
-"Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this "
-"results in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT "
-"as **TTFT3**."
-msgstr "将数据集A发送到实例2。借助Mooncake KV缓存池，这将导致一次来自节点1 DRAM的跨节点KV缓存命中。记录TTFT为 **TTFT3**。"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:301
-msgid "**Model Configuration**:"
-msgstr "**模型配置**："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:329
-msgid "**Performance Benchmarking Commands**:"
-msgstr "**性能基准测试命令**："
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:337
-msgid "Test Results"
-msgstr "测试结果"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Requests"
-msgstr "请求数"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "Concur"
-msgstr "并发数"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "TTFT1 (ms)"
-msgstr "TTFT1 (毫秒)"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "TTFT2 (ms)"
-msgstr "TTFT2 (毫秒)"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "TTFT3 (ms)"
-msgstr "TTFT3 (毫秒)"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "100"
-msgstr "100"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "25"
-msgstr "25"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "2322"
-msgstr "2322"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "739"
-msgstr "739"
-
-#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md
-msgid "948"
-msgstr "948"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po
@@ -1,501 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:1
-msgid "Prefill-Decode Disaggregation (Deepseek)"
-msgstr "预填充-解码解耦部署 (Deepseek)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:3
-msgid "Getting Started"
-msgstr "快速开始"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:5
-msgid ""
-"vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP "
-"(Expert Parallel) options. This guide takes one-by-one steps to verify "
-"these features with constrained resources."
-msgstr "vLLM-Ascend 现已支持结合专家并行（EP）选项的预填充-解码（PD）解耦部署。本指南将逐步引导您在有限资源下验证这些功能。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:7
-msgid ""
-"Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 "
-"servers to deploy the \"2P1D\" architecture. Assume the IP of the "
-"prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and "
-"the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). "
-"On each server, use 8 NPUs and 16 chips to deploy one service instance."
-msgstr ""
-"以 Deepseek-r1-w8a8 模型为例，使用 4 台 Atlas 800T A3 服务器部署 \"2P1D\" 架构。假设预填充服务器 "
-"IP 为 192.0.0.1（预填充节点 1）和 192.0.0.2（预填充节点 2），解码服务器 IP 为 192.0.0.3（解码节点 1）和"
-" 192.0.0.4（解码节点 2）。每台服务器使用 8 个 NPU（16 个芯片）部署一个服务实例。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:9
-msgid "Verify Multi-Node Communication Environment"
-msgstr "验证多节点通信环境"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:11
-msgid "Physical Layer Requirements"
-msgstr "物理层要求"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:13
-msgid ""
-"The physical machines must be located on the same WLAN, with network "
-"connectivity."
-msgstr "物理服务器必须位于同一局域网内，并具备网络连通性。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:14
-msgid ""
-"All NPUs must be interconnected. Intra-node connectivity is via HCCS, and"
-" inter-node connectivity is via RDMA."
-msgstr "所有 NPU 必须能够互联。节点内通过 HCCS 连接，节点间通过 RDMA 连接。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:16
-msgid "Verification Process"
-msgstr "验证流程"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:18
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:27
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:83
-msgid ""
-"Execute the following commands on each node in sequence. The results must"
-" all be `success` and the status must be `UP`:"
-msgstr "依次在每个节点上执行以下命令。所有结果必须为 `success` 且状态必须为 `UP`："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-msgid "A3"
-msgstr "A3"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:25
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:81
-msgid "Single Node Verification:"
-msgstr "单节点验证："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:42
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:98
-msgid "Check NPU HCCN Configuration:"
-msgstr "检查 NPU HCCN 配置："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:44
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:100
-msgid ""
-"Ensure that the hccn.conf file exists in the environment. If using "
-"Docker, mount it into the container."
-msgstr "确保环境中存在 hccn.conf 文件。如果使用 Docker，请将其挂载到容器中。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:50
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:106
-msgid "Get NPU IP Addresses"
-msgstr "获取 NPU IP 地址"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:57
-msgid "Get superpodid and SDID"
-msgstr "获取 superpodid 和 SDID"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:63
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:112
-msgid "Cross-Node PING Test"
-msgstr "跨节点 PING 测试"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:70
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:119
-msgid "Check NPU TLS Configuration"
-msgstr "检查 NPU TLS 配置"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-msgid "A2"
-msgstr "A2"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:128
-msgid "Run with Docker"
-msgstr "使用 Docker 运行"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:130
-msgid "Start a Docker container on each node."
-msgstr "在每个节点上启动一个 Docker 容器。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:174
-msgid "Install Mooncake"
-msgstr "安装 Mooncake"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:176
-msgid ""
-"Mooncake is the serving platform for Kimi, a leading LLM service provided"
-" by Moonshot AI.Installation and Compilation Guide: <https://github.com"
-"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries> First, we"
-" need to obtain the Mooncake project. Refer to the following command:"
-msgstr ""
-"Mooncake 是月之暗面（Moonshot AI）提供的领先 LLM 服务 Kimi "
-"的推理平台。安装与编译指南：<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file"
-"#build-and-use-binaries> 首先，我们需要获取 Mooncake 项目。参考以下命令："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:183
-msgid "(Optional) Replace go install url if the network is poor"
-msgstr "（可选）如果网络状况不佳，请替换 go install 的 URL"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:190
-msgid "Install mpi"
-msgstr "安装 mpi"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:196
-msgid "Install the relevant dependencies. The installation of Go is not required."
-msgstr "安装相关依赖。无需安装 Go。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:202
-msgid "Compile and install"
-msgstr "编译并安装"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:212
-msgid "Set environment variables"
-msgstr "设置环境变量"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:214
-msgid "**Note:**"
-msgstr "**注意：**"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:216
-msgid "Adjust the Python path according to your specific Python installation"
-msgstr "请根据您具体的 Python 安装路径进行调整"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:217
-msgid ""
-"Ensure `/usr/local/lib` and `/usr/local/lib64` are in your "
-"`LD_LIBRARY_PATH`"
-msgstr "确保 `/usr/local/lib` 和 `/usr/local/lib64` 在您的 `LD_LIBRARY_PATH` 中"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:223
-msgid "Prefiller/Decoder Deployment"
-msgstr "预填充器/解码器部署"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:225
-msgid ""
-"We can run the following scripts to launch a server on the "
-"prefiller/decoder node, respectively. Please note that each P/D node will"
-" occupy ports ranging from kv_port to kv_port + num_chips to initialize "
-"socket listeners. To avoid any issues, port conflicts should be "
-"prevented. Additionally, ensure that each node's engine_id is uniquely "
-"assigned to avoid conflicts."
-msgstr ""
-"我们可以分别运行以下脚本来在预填充器/解码器节点上启动服务器。请注意，每个 P/D 节点将占用从 kv_port 到 kv_port + "
-"num_chips 的端口范围来初始化 socket 监听器。为避免问题，应防止端口冲突。此外，请确保每个节点的 engine_id "
-"被唯一分配，以避免冲突。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:227
-msgid "kv_port Configuration Guide"
-msgstr "kv_port 配置指南"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:229
-msgid ""
-"On Ascend NPU, Mooncake uses AscendDirectTransport for RDMA data "
-"transfer, which randomly allocates ports within range `[20000, 20000 + "
-"npu_per_node × 1000)`. If `kv_port` overlaps with this range, "
-"intermittent port conflicts may occur. To avoid this, configure `kv_port`"
-" according to the table below:"
-msgstr ""
-"在 Ascend NPU 上，Mooncake 使用 AscendDirectTransport 进行 RDMA 数据传输，它会在 "
-"`[20000, 20000 + npu_per_node × 1000)` 范围内随机分配端口。如果 `kv_port` "
-"与此范围重叠，可能会发生间歇性端口冲突。为避免此问题，请根据下表配置 `kv_port`："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "NPUs per Node"
-msgstr "每节点 NPU 数量"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "Reserved Port Range"
-msgstr "保留端口范围"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "Recommended kv_port"
-msgstr "推荐 kv_port"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "8"
-msgstr "8"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "20000 - 27999"
-msgstr "20000 - 27999"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid ">= 28000"
-msgstr ">= 28000"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "16"
-msgstr "16"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid "20000 - 35999"
-msgstr "20000 - 35999"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
-msgid ">= 36000"
-msgstr ">= 36000"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:237
-msgid ""
-"If you occasionally see `zmq.error.ZMQError: Address already in use` "
-"during startup, it may be caused by kv_port conflicting with randomly "
-"allocated AscendDirectTransport ports. Increase your kv_port value to "
-"avoid the reserved range."
-msgstr ""
-"如果在启动时偶尔看到 `zmq.error.ZMQError: Address already in use`，可能是由于 kv_port "
-"与随机分配的 AscendDirectTransport 端口冲突所致。请增加您的 kv_port 值以避开保留范围。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:240
-msgid "launch_online_dp.py"
-msgstr "launch_online_dp.py"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:242
-msgid ""
-"Use `launch_online_dp.py` to launch external dp vllm servers. "
-"[launch_online_dp.py](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
-msgstr ""
-"使用 `launch_online_dp.py` 启动外部解耦 vllm "
-"服务器。[launch_online_dp.py](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:245
-msgid "run_dp_template.sh"
-msgstr "run_dp_template.sh"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:247
-msgid ""
-"Modify `run_dp_template.sh` on each node. "
-"[run_dp_template.sh](https://github.com/vllm-project/vllm-"
-"ascend/blob/main/examples/external_online_dp/run_dp_template.sh)"
-msgstr ""
-"在每个节点上修改 `run_dp_template.sh`。[run_dp_template.sh](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/external_online_dp/run_dp_template.sh)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:250
-msgid "Layerwise"
-msgstr "分层模式"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-msgid "Prefiller node 1"
-msgstr "预填充节点 1"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-msgid "Prefiller node 2"
-msgstr "预填充节点 2"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-msgid "Decoder node 1"
-msgstr "解码节点 1"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-msgid "Decoder node 2"
-msgstr "解码节点 2"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:493
-msgid "Non-layerwise"
-msgstr "非分层模式"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:735
-msgid "Start the service"
-msgstr "启动服务"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:748
-msgid "Example Proxy for Deployment"
-msgstr "部署示例代理"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:750
-msgid ""
-"Run a proxy server on the same node where your prefiller service instance"
-" is deployed. You can find the proxy implementation in the repository's "
-"examples directory."
-msgstr "在部署了预填充器服务实例的同一节点上运行一个代理服务器。您可以在仓库的 examples 目录中找到代理实现。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:752
-msgid ""
-"We provide two different proxy implementations with distinct request "
-"routing behaviors:"
-msgstr "我们提供两种具有不同请求路由行为的代理实现："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:754
-msgid ""
-"**`load_balance_proxy_layerwise_server_example.py`**: Requests are first "
-"routed to the D nodes, which then forward to the P nodes as needed.This "
-"proxy is designed for use with the "
-"MooncakeLayerwiseConnector.[load\\_balance\\_proxy\\_layerwise\\_server\\_example.py](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)"
-msgstr ""
-"**`load_balance_proxy_layerwise_server_example.py`**：请求首先被路由到 D "
-"节点，然后根据需要转发到 P 节点。此代理设计用于与 MooncakeLayerwiseConnector "
-"配合使用。[load_balance_proxy_layerwise_server_example.py](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:756
-msgid ""
-"**`load_balance_proxy_server_example.py`**: Requests are first routed to "
-"the P nodes, which then forward to the D nodes for subsequent "
-"processing.This proxy is designed for use with the "
-"MooncakeConnector.[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-msgstr ""
-"**`load_balance_proxy_server_example.py`**：请求首先被路由到 P 节点，然后转发到 D "
-"节点进行后续处理。此代理设计用于与 MooncakeConnector "
-"配合使用。[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Parameter"
-msgstr "参数"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "meaning"
-msgstr "含义"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "--port"
-msgstr "--port"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Proxy service Port"
-msgstr "代理服务端口"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "--host"
-msgstr "--host"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Proxy service Host IP"
-msgstr "代理服务主机 IP"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "--prefiller-hosts"
-msgstr "--prefiller-hosts"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Hosts of prefiller nodes"
-msgstr "预填充节点主机列表"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "--prefiller-ports"
-msgstr "--prefiller-ports"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Ports of prefiller nodes"
-msgstr "预填充节点端口"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "--decoder-hosts"
-msgstr "--decoder-hosts"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Hosts of decoder nodes"
-msgstr "解码器节点主机地址"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "--decoder-ports"
-msgstr "--decoder-ports"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
-msgid "Ports of decoder nodes"
-msgstr "解码器节点端口"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:877
-msgid ""
-"You can get the proxy program in the repository's examples, "
-"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
-"project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-msgstr ""
-"您可以在代码仓库的示例中找到代理程序，[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
-"/vllm-project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:879
-msgid "Benchmark"
-msgstr "基准测试"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:881
-msgid ""
-"We recommend use aisbench tool to assess performance. "
-"[aisbench](https://gitee.com/aisbench/benchmark) Execute the following "
-"commands to install aisbench"
-msgstr ""
-"我们推荐使用 aisbench 工具进行性能评估。[aisbench](https://gitee.com/aisbench/benchmark)"
-" 执行以下命令安装 aisbench"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:889
-msgid ""
-"You need to cancel the http proxy before assessing performance, as "
-"following"
-msgstr "在评估性能前，您需要取消 HTTP 代理设置，如下所示"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:897
-msgid "You can place your datasets in the dir: `benchmark/ais_bench/datasets`"
-msgstr "您可以将数据集放置在目录：`benchmark/ais_bench/datasets` 中"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:898
-msgid ""
-"You can change the configuration in the dir "
-":`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the "
-"``vllm_api_stream_chat.py`` for example"
-msgstr ""
-"您可以在目录 `benchmark/ais_bench/benchmark/configs/models/vllm_api` 中修改配置。以 "
-"`vllm_api_stream_chat.py` 为例"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:924
-msgid ""
-"Take gsm8k dataset for example, execute the following commands  to assess"
-" performance."
-msgstr "以 gsm8k 数据集为例，执行以下命令来评估性能。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:930
-msgid ""
-"For more details for commands and parameters for aisbench, refer to  "
-"[aisbench](https://gitee.com/aisbench/benchmark)"
-msgstr ""
-"有关 aisbench 命令和参数的更多详细信息，请参考 "
-"[aisbench](https://gitee.com/aisbench/benchmark)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:932
-msgid "FAQ"
-msgstr "常见问题"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:934
-msgid "1. Prefiller nodes need to warmup"
-msgstr "1. 预填充节点需要预热"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:936
-msgid ""
-"Since the computation of some NPU operators requires several rounds of "
-"warm-up to achieve best performance, we recommend preheating the service "
-"with some requests before conducting performance tests to achieve the "
-"best end-to-end throughput."
-msgstr "由于部分 NPU 算子的计算需要经过多轮预热才能达到最佳性能，我们建议在进行性能测试前，先用一些请求预热服务，以获得最佳的端到端吞吐量。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:938
-msgid "Verification"
-msgstr "验证"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:940
-msgid "Check service health using the proxy server endpoint."
-msgstr "使用代理服务器端点检查服务健康状况。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po
@@ -1,214 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-22 08:13+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:1
-msgid "Prefill-Decode Disaggregation (Qwen2.5-VL)"
-msgstr "预填充-解码解耦架构 (Qwen2.5-VL)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:3
-msgid "Getting Started"
-msgstr "开始使用"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:5
-msgid ""
-"vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide "
-"provides step-by-step instructions to verify this features in resource-"
-"constrained environments."
-msgstr "vLLM-Ascend 现已支持预填充-解码 (PD) 解耦架构。本指南提供逐步说明，帮助您在资源受限的环境中验证这些功能。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:7
-msgid ""
-"Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend "
-"v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "
-"\"1P1D\" architecture (one Prefiller and one Decoder on the same node). "
-"Assume the IP address is 192.0.0.1."
-msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例，在 1 台 Atlas 800T A2 服务器上使用 vllm-ascend v0.11.0rc1（包含 vLLM v0.11.0）部署 \"1P1D\" 架构（同一节点上一个预填充器和一个解码器）。假设 IP 地址为 192.0.0.1。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:9
-msgid "Verify Communication Environment"
-msgstr "验证通信环境"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:11
-msgid "Verification Process"
-msgstr "验证流程"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:13
-msgid "Single Node Verification:"
-msgstr "单节点验证："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:15
-msgid ""
-"Execute the following commands in sequence. The results must all be "
-"`success` and the status must be `UP`:"
-msgstr "依次执行以下命令。结果必须均为 `success` 且状态必须为 `UP`："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:30
-msgid "Check NPU HCCN Configuration:"
-msgstr "检查 NPU HCCN 配置："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:32
-msgid ""
-"Ensure that the hccn.conf file exists in the environment. If using "
-"Docker, mount it into the container."
-msgstr "确保环境中存在 hccn.conf 文件。如果使用 Docker，请将其挂载到容器中。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:38
-msgid "Get NPU IP Addresses"
-msgstr "获取 NPU IP 地址"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:44
-msgid "Cross-Node PING Test"
-msgstr "跨节点 PING 测试"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:51
-msgid "Check NPU TLS Configuration"
-msgstr "检查 NPU TLS 配置"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:58
-msgid "Run with Docker"
-msgstr "使用 Docker 运行"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:60
-msgid "Start a Docker container."
-msgstr "启动一个 Docker 容器。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:95
-msgid "Install Mooncake"
-msgstr "安装 Mooncake"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:97
-msgid ""
-"Mooncake is the serving platform for Kimi, a leading LLM service provided"
-" by Moonshot AI. Installation and Compilation Guide: <https://github.com"
-"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>. First, "
-"we need to obtain the Mooncake project. Refer to the following command:"
-msgstr "Mooncake 是 Kimi 的服务平台，Kimi 是由 Moonshot AI 提供的领先 LLM 服务。安装与编译指南：<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>。首先，我们需要获取 Mooncake 项目。参考以下命令："
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:104
-msgid "(Optional) Replace go install url if the network is poor."
-msgstr "（可选）如果网络状况不佳，请替换 go install 的 URL。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:111
-msgid "Install mpi."
-msgstr "安装 mpi。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:117
-msgid "Install the relevant dependencies. The installation of Go is not required."
-msgstr "安装相关依赖。无需安装 Go。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:123
-msgid "Compile and install."
-msgstr "编译并安装。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:133
-msgid "Set environment variables."
-msgstr "设置环境变量。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:135
-msgid "**Note:**"
-msgstr "**注意：**"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:137
-msgid "Adjust the Python path according to your specific Python installation"
-msgstr "根据您具体的 Python 安装情况调整 Python 路径"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:138
-msgid ""
-"Ensure `/usr/local/lib` and `/usr/local/lib64` are in your "
-"`LD_LIBRARY_PATH`"
-msgstr "确保 `/usr/local/lib` 和 `/usr/local/lib64` 在您的 `LD_LIBRARY_PATH` 中"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:144
-msgid "Prefiller/Decoder Deployment"
-msgstr "预填充器/解码器部署"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:146
-msgid ""
-"We can run the following scripts to launch a server on the "
-"prefiller/decoder NPU, respectively."
-msgstr "我们可以分别运行以下脚本来在预填充器/解码器 NPU 上启动服务器。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md
-msgid "Prefiller"
-msgstr "预填充器"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md
-msgid "Decoder"
-msgstr "解码器"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:236
-msgid ""
-"If you want to run \"2P1D\", please set ASCEND_RT_VISIBLE_DEVICES and "
-"port to different values for each P process."
-msgstr "如果您想运行 \"2P1D\"，请为每个 P 进程将 ASCEND_RT_VISIBLE_DEVICES 和 port 设置为不同的值。"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:238
-msgid "Example Proxy for Deployment"
-msgstr "部署示例代理"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:240
-msgid ""
-"Run a proxy server on the same node with the prefiller service instance. "
-"You can get the proxy program in the repository's examples: "
-"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
-"project/vllm-"
-"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-msgstr "在与预填充器服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到该代理程序：[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "Parameter"
-msgstr "参数"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "Meaning"
-msgstr "含义"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "--port"
-msgstr "--port"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "Port of proxy"
-msgstr "代理端口"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "--prefiller-port"
-msgstr "--prefiller-port"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "All ports of prefill"
-msgstr "所有预填充端口"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "--decoder-ports"
-msgstr "--decoder-ports"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:193
-msgid "All ports of decoder"
-msgstr "所有解码器端口"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:258
-msgid "Verification"
-msgstr "验证"
-
-#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:260
-msgid "Check service health using the proxy server endpoint."
-msgstr "使用代理服务器端点检查服务健康状态。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/ray.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/ray.po
@@ -1,219 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/ray.md:1
-msgid "Ray Distributed (Qwen3-235B-A22B)"
-msgstr "Ray 分布式部署 (Qwen3-235B-A22B)"
-
-#: ../../source/tutorials/features/ray.md:3
-msgid ""
-"Multi-node inference is suitable for scenarios where the model cannot be "
-"deployed on a single machine. In such cases, the model can be distributed"
-" using tensor parallelism or pipeline parallelism. The specific "
-"parallelism strategies will be covered in the following sections. To "
-"successfully deploy multi-node inference, the following three steps need "
-"to be completed:"
-msgstr ""
-"多节点推理适用于模型无法在单机上部署的场景。在这种情况下，可以使用张量并行或流水线并行来分布模型。具体的并行策略将在后续章节中介绍。要成功部署多节点推理，需要完成以下三个步骤："
-
-#: ../../source/tutorials/features/ray.md:5
-msgid "**Verify Multi-Node Communication Environment**"
-msgstr "**验证多节点通信环境**"
-
-#: ../../source/tutorials/features/ray.md:6
-msgid "**Set Up and Start the Ray Cluster**"
-msgstr "**设置并启动 Ray 集群**"
-
-#: ../../source/tutorials/features/ray.md:7
-msgid "**Start the Online Inference Service on Multi-node**"
-msgstr "**在多节点上启动在线推理服务**"
-
-#: ../../source/tutorials/features/ray.md:9
-msgid "Verify Multi-Node Communication Environment"
-msgstr "验证多节点通信环境"
-
-#: ../../source/tutorials/features/ray.md:11
-msgid "Physical Layer Requirements"
-msgstr "物理层要求"
-
-#: ../../source/tutorials/features/ray.md:13
-msgid ""
-"The physical machines must be located on the same LAN, with network "
-"connectivity."
-msgstr "物理机必须位于同一局域网内，并具备网络连通性。"
-
-#: ../../source/tutorials/features/ray.md:14
-msgid ""
-"All NPUs are connected with optical modules, and the connection status "
-"must be normal."
-msgstr "所有 NPU 均通过光模块连接，且连接状态必须正常。"
-
-#: ../../source/tutorials/features/ray.md:16
-msgid "Verification Process"
-msgstr "验证流程"
-
-#: ../../source/tutorials/features/ray.md:18
-msgid ""
-"Execute the following commands on each node in sequence. The results must"
-" all be `success` and the status must be `UP`:"
-msgstr "依次在每个节点上执行以下命令。结果必须均为 `success`，状态必须为 `UP`："
-
-#: ../../source/tutorials/features/ray.md:35
-msgid "NPU Interconnect Verification"
-msgstr "NPU 互联验证"
-
-#: ../../source/tutorials/features/ray.md:37
-msgid "1. Get NPU IP Addresses"
-msgstr "1. 获取 NPU IP 地址"
-
-#: ../../source/tutorials/features/ray.md:43
-msgid "2. Cross-Node PING Test"
-msgstr "2. 跨节点 PING 测试"
-
-#: ../../source/tutorials/features/ray.md:50
-msgid "Set Up and Start the Ray Cluster"
-msgstr "设置并启动 Ray 集群"
-
-#: ../../source/tutorials/features/ray.md:52
-msgid "Setting Up the Basic Container"
-msgstr "设置基础容器"
-
-#: ../../source/tutorials/features/ray.md:54
-msgid ""
-"To ensure a consistent execution environment across all nodes, including "
-"the model path and Python environment, it is advised to use Docker "
-"images."
-msgstr "为确保所有节点（包括模型路径和 Python 环境）的执行环境一致，建议使用 Docker 镜像。"
-
-#: ../../source/tutorials/features/ray.md:56
-msgid ""
-"For setting up a multi-node inference cluster with Ray, **containerized "
-"deployment** is the preferred approach. Containers should be started on "
-"both the primary and secondary nodes, with the `--net=host` option to "
-"enable proper network connectivity."
-msgstr "对于使用 Ray 设置多节点推理集群，**容器化部署**是首选方法。应在主节点和从节点上都启动容器，并使用 `--net=host` 选项以确保正确的网络连接。"
-
-#: ../../source/tutorials/features/ray.md:58
-msgid ""
-"Below is the example container setup command, which should be executed on"
-" **all nodes** :"
-msgstr "以下是容器设置命令示例，应在 **所有节点** 上执行："
-
-#: ../../source/tutorials/features/ray.md:94
-msgid "Start Ray Cluster"
-msgstr "启动 Ray 集群"
-
-#: ../../source/tutorials/features/ray.md:96
-msgid ""
-"After setting up the containers and installing vllm-ascend on each node, "
-"follow the steps below to start the Ray cluster and execute inference "
-"tasks."
-msgstr "在每个节点上设置好容器并安装 vllm-ascend 后，按照以下步骤启动 Ray 集群并执行推理任务。"
-
-#: ../../source/tutorials/features/ray.md:98
-msgid ""
-"Choose one machine as the primary node and the others as secondary nodes."
-" Before proceeding, use `ip addr` to check your `nic_name` (network "
-"interface name)."
-msgstr "选择一台机器作为主节点，其他作为从节点。在继续之前，使用 `ip addr` 检查您的 `nic_name`（网络接口名称）。"
-
-#: ../../source/tutorials/features/ray.md:100
-msgid ""
-"Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the "
-"NPU devices to use. For Ray versions above 2.1, also set the "
-"`RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid "
-"device recognition issues."
-msgstr "设置 `ASCEND_RT_VISIBLE_DEVICES` 环境变量以指定要使用的 NPU 设备。对于 Ray 2.1 以上版本，还需设置 `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` 变量以避免设备识别问题。"
-
-#: ../../source/tutorials/features/ray.md:102
-msgid "Below are the commands for the primary and secondary nodes:"
-msgstr "以下是主节点和从节点的命令："
-
-#: ../../source/tutorials/features/ray.md:104
-msgid "**Primary node**:"
-msgstr "**主节点**："
-
-#: ../../source/tutorials/features/ray.md:107
-#: ../../source/tutorials/features/ray.md:124
-msgid ""
-"When starting a Ray cluster for multi-node inference, the environment "
-"variables on each node must be set **before** starting the Ray cluster "
-"for them to take effect. Updating the environment variables requires "
-"restarting the Ray cluster."
-msgstr "在为多节点推理启动 Ray 集群时，必须在启动 Ray 集群 **之前** 设置每个节点上的环境变量，它们才会生效。更新环境变量需要重启 Ray 集群。"
-
-#: ../../source/tutorials/features/ray.md:121
-msgid "**Secondary node**:"
-msgstr "**从节点**："
-
-#: ../../source/tutorials/features/ray.md:137
-msgid ""
-"Once the cluster is started on multiple nodes, execute `ray status` and "
-"`ray list nodes` to verify the Ray cluster's status. You should see the "
-"correct number of nodes and NPUs listed."
-msgstr "在多个节点上启动集群后，执行 `ray status` 和 `ray list nodes` 以验证 Ray 集群的状态。您应该看到列出的正确节点数和 NPU 数。"
-
-#: ../../source/tutorials/features/ray.md:139
-msgid ""
-"After Ray is successfully started, the following content will appear: A "
-"local Ray instance has started successfully. Dashboard URL: The access "
-"address for the Ray Dashboard (default: <http://localhost:8265>); Node "
-"status (CPU/memory resources, number of healthy nodes); Cluster "
-"connection address (used for adding multiple nodes)."
-msgstr "Ray 成功启动后，将出现以下内容：本地 Ray 实例已成功启动。仪表板 URL：Ray 仪表板的访问地址（默认：<http://localhost:8265>）；节点状态（CPU/内存资源、健康节点数）；集群连接地址（用于添加多个节点）。"
-
-#: ../../source/tutorials/features/ray.md:143
-msgid "Start the Online Inference Service on Multi-node scenario"
-msgstr "在多节点场景下启动在线推理服务"
-
-#: ../../source/tutorials/features/ray.md:145
-msgid ""
-"In the container, you can use vLLM as if all NPUs were on a single node. "
-"vLLM will utilize NPU resources across all nodes in the Ray cluster."
-msgstr "在容器中，您可以像所有 NPU 都在单个节点上一样使用 vLLM。vLLM 将利用 Ray 集群中所有节点的 NPU 资源。"
-
-#: ../../source/tutorials/features/ray.md:147
-msgid "**You only need to run the vllm command on one node.**"
-msgstr "**您只需在一个节点上运行 vllm 命令。**"
-
-#: ../../source/tutorials/features/ray.md:149
-msgid ""
-"To set up parallelism, the common practice is to set the `tensor-"
-"parallel-size` to the number of NPUs per node, and the `pipeline-"
-"parallel-size` to the number of nodes."
-msgstr "要设置并行，通常的做法是将 `tensor-parallel-size` 设置为每个节点的 NPU 数量，将 `pipeline-parallel-size` 设置为节点数量。"
-
-#: ../../source/tutorials/features/ray.md:151
-msgid ""
-"For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the "
-"tensor parallel size to 8 and the pipeline parallel size to 2:"
-msgstr "例如，对于分布在 2 个节点上的 16 个 NPU（每个节点 8 个 NPU），将张量并行大小设置为 8，流水线并行大小设置为 2："
-
-#: ../../source/tutorials/features/ray.md:167
-msgid ""
-"Alternatively, if you want to use only tensor parallelism, set the tensor"
-" parallel size to the total number of NPUs in the cluster. For example, "
-"with 16 NPUs across 2 nodes, set the tensor parallel size to 16:"
-msgstr "或者，如果您只想使用张量并行，请将张量并行大小设置为集群中 NPU 的总数。例如，对于分布在 2 个节点上的 16 个 NPU，将张量并行大小设置为 16："
-
-#: ../../source/tutorials/features/ray.md:182
-msgid "Once your server is started, you can query the model with input prompts:"
-msgstr "服务器启动后，您可以使用输入提示词查询模型："
--- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/suffix_speculative_decoding.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/suffix_speculative_decoding.po
@@ -1,854 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2025, vllm-ascend team
-# This file is distributed under the same license as the vllm-ascend
-# package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
-#
-msgid ""
-msgstr ""
-"Project-Id-Version: vllm-ascend \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2026-04-14 09:08+0000\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.18.0\n"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:1
-msgid "Suffix Speculative Decoding"
-msgstr "后缀推测解码"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:3
-msgid "**Introduction**"
-msgstr "**简介**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:5
-msgid ""
-"Suffix Decoding is an optimization technique for speculative decoding "
-"based on pattern matching. It simultaneously retrieves repetitive "
-"sequences from both the prompt and the generated content, using frequency"
-" statistics to predict the most likely token continuations. Unlike "
-"traditional speculative decoding methods, Suffix Decoding runs entirely "
-"on the CPU, eliminating the need for additional GPU resources or draft "
-"models, which results in superior acceleration for repetitive tasks such "
-"as AI agents and code generation."
-msgstr ""
-"后缀解码是一种基于模式匹配的推测解码优化技术。它同时从提示词和已生成内容中检索重复序列，利用频率统计来预测最可能的后续标记。与传统的推测解码方法不同，后缀解码完全在CPU上运行，无需额外的GPU资源或草稿模型，从而在AI智能体和代码生成等重复性任务上实现卓越的加速效果。"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:7
-msgid ""
-"This document provides step-by-step guidance on how to deploy and "
-"benchmark the Suffix Decoding speculative inference technology supported "
-"by `vllm-ascend` on Atlas A2 hardware. The setup utilizes a single Atlas "
-"800T A2 node with a 4-card deployment of the Qwen3-32B model instance. "
-"Benchmarking is conducted using authentic open-source datasets covering "
-"the following categories:"
-msgstr ""
-"本文档提供了在Atlas A2硬件上部署和基准测试`vllm-ascend`支持的后缀解码推测推理技术的分步指南。该设置使用单个Atlas 800T A2节点，部署了4卡的Qwen3-32B模型实例。基准测试使用涵盖以下类别的真实开源数据集进行："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**Dataset Category**"
-msgstr "**数据集类别**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**Dataset Name**"
-msgstr "**数据集名称**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Code Generation"
-msgstr "代码生成"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "HumanEval"
-msgstr "HumanEval"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Common Sense Reasoning"
-msgstr "常识推理"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "ARC"
-msgstr "ARC"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Mathematical Reasoning"
-msgstr "数学推理"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "gsm8k"
-msgstr "gsm8k"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Natural Language Understanding"
-msgstr "自然语言理解"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "SuperGLUE_BoolQ"
-msgstr "SuperGLUE_BoolQ"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Comprehensive Examination"
-msgstr "综合评测"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "AGIEval"
-msgstr "AGIEval"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Multi-turn Dialogue"
-msgstr "多轮对话"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "ShareGPT"
-msgstr "ShareGPT"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:18
-#, python-format
-msgid ""
-"The benchmarking tool used in this tutorial is AISBench, which supports "
-"performance testing for all the datasets listed above. The final section "
-"of this tutorial presents a performance comparison between enabling and "
-"disabling Suffix Decoding under the condition of satisfying an SLO TPOT <"
-" 50ms across different datasets and concurrency levels. Validations "
-"demonstrate that the Qwen3-32B model achieves a throughput improvement of"
-" approximately 20% to 80% on various real-world datasets when Suffix "
-"Decoding is enabled."
-msgstr ""
-"本教程使用的基准测试工具是AISBench，它支持对上述所有数据集进行性能测试。本教程最后一节展示了在不同数据集和并发级别下，满足SLO TPOT < 50ms条件时，启用与禁用后缀解码的性能对比。验证表明，启用后缀解码后，Qwen3-32B模型在各种真实数据集上实现了约20%至80%的吞吐量提升。"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:20
-msgid "**Download vllm-ascend Image**"
-msgstr "**下载 vllm-ascend 镜像**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:22
-msgid ""
-"This tutorial uses the official image, version v0.13.0rc1. Use the "
-"following command to download:"
-msgstr "本教程使用官方镜像，版本为v0.13.0rc1。使用以下命令下载："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:28
-msgid "**Run with Docker**"
-msgstr "**使用 Docker 运行**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:30
-msgid "Container startup command:"
-msgstr "容器启动命令："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:64
-msgid "**Install arctic-inference**"
-msgstr "**安装 arctic-inference**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:66
-msgid ""
-"Before enabling Suffix Decoding speculative inference on Ascend, the "
-"Arctic Inference plugin must be installed. Arctic Inference is an open-"
-"source plugin launched by Snowflake specifically to optimize LLM "
-"inference speed. For detailed technical principles, please refer to the "
-"following article: [Fastest Speculative Decoding in vLLM with Arctic "
-"Inference and Arctic Training](https://www.snowflake.com/en/engineering-"
-"blog/fast-speculative-decoding-vllm-arctic/). Install it within the "
-"container using the following command:"
-msgstr ""
-"在Ascend上启用后缀解码推测推理之前，必须安装Arctic Inference插件。Arctic Inference是Snowflake推出的一个开源插件，专门用于优化LLM推理速度。详细技术原理请参考以下文章：[Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/)。在容器内使用以下命令安装："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:72
-msgid "**vLLM Instance Deployment**"
-msgstr "**vLLM 实例部署**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:74
-msgid ""
-"Use the following command to start the container service instance. "
-"Speculative inference is enabled via the `--speculative-config` "
-"parameter, where `method` is set to `suffix`. For this test, "
-"`num_speculative_tokens` is uniformly set to `3`."
-msgstr ""
-"使用以下命令启动容器服务实例。通过`--speculative-config`参数启用推测推理，其中`method`设置为`suffix`。本次测试中，`num_speculative_tokens`统一设置为`3`。"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:99
-msgid "**AISbench Benchmark Testing**"
-msgstr "**AISbench 基准测试**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:101
-msgid ""
-"Performance for all open-source datasets is tested using AISbench. For "
-"specific instructions, refer to [Using AISBench for performance "
-"evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html"
-"#execute-performance-evaluation)."
-msgstr ""
-"所有开源数据集的性能均使用AISbench进行测试。具体操作说明请参考[使用AISBench进行性能评估](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation)。"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:103
-msgid "**Model Configuration**:"
-msgstr "**模型配置**："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:132
-msgid "**Performance Benchmarking Commands**:"
-msgstr "**性能基准测试命令**："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:141
-msgid "**Test Results**"
-msgstr "**测试结果**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:143
-msgid ""
-"Below are the detailed test results of the six open-source datasets in "
-"this evaluation. Compared to the baseline performance, the improvement in"
-" TPOT and throughput performance at different concurrency levels after "
-"enabling Suffix Decoding varies across datasets. The extent of "
-"improvement after enabling Suffix Decoding differs among the datasets. "
-"Below is a summary of the results:"
-msgstr ""
-"以下是本次评估中六个开源数据集的详细测试结果。与基线性能相比，启用后缀解码后，不同并发级别下的TPOT和吞吐量性能提升程度因数据集而异。启用后缀解码后的提升幅度在不同数据集间存在差异。以下是结果总结："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**Typical Representative**"
-msgstr "**典型代表**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**Throughput Improvement (BS=1-10)**"
-msgstr "**吞吐量提升 (BS=1-10)**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**SLO TPOT**"
-msgstr "**SLO TPOT**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**High Gain**"
-msgstr "**高增益**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "AGIEval, GSM8K"
-msgstr "AGIEval, GSM8K"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**> 50%**"
-msgstr "**> 50%**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "< 50ms"
-msgstr "< 50ms"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**Medium-Low Gain**"
-msgstr "**中低增益**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "ARC, ShareGPT"
-msgstr "ARC, ShareGPT"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**20% ~ 30%**"
-msgstr "**20% ~ 30%**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md:150
-msgid "Below is the raw detailed test results:"
-msgstr "以下是原始详细测试结果："
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Concurrency"
-msgstr "并发数"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Avg Input"
-msgstr "平均输入长度"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Avg Output"
-msgstr "平均输出长度"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Requests"
-msgstr "请求数"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Base TPOT(ms)"
-msgstr "基线 TPOT(ms)"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Base Throughput(TPS)"
-msgstr "基线吞吐量(TPS)"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Suffix TPOT(ms)"
-msgstr "后缀解码 TPOT(ms)"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Suffix Throughput(TPS)"
-msgstr "后缀解码吞吐量(TPS)"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "Accept Rate"
-msgstr "接受率"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "TPOT Gain"
-msgstr "TPOT 增益"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "TPS Gain"
-msgstr "TPS 增益"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**Humaneval**"
-msgstr "**Humaneval**"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "1"
-msgstr "1"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "150"
-msgstr "150"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "2700"
-msgstr "2700"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "100"
-msgstr "100"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "55.1"
-msgstr "55.1"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "18.1"
-msgstr "18.1"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "37.9"
-msgstr "37.9"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "26.3"
-msgstr "26.3"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "27.0%"
-msgstr "27.0%"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "45.2%"
-msgstr "45.2%"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "45.1%"
-msgstr "45.1%"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "15"
-msgstr "15"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "61.6"
-msgstr "61.6"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "233.8"
-msgstr "233.8"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "45.8"
-msgstr "45.8"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "318.2"
-msgstr "318.2"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "34.6%"
-msgstr "34.6%"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "36.1%"
-msgstr "36.1%"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "26"
-msgstr "26"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "64.7"
-msgstr "64.7"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "403.8"
-msgstr "403.8"
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "50.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "519.2"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "27.2%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "28.6%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**ARC**"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "76"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "960"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "52.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "18.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "39.5"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "25.4"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "23.9%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "33.7%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "59.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "125.4"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "47.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "163.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "25.7%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "30.0%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "59.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "245.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "48.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "311.7"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "22.3%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "26.8%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**GSM8K**"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "67"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "1570"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "55.5"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "18.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "35.7"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "28.5"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "31.1%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "55.6%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "58.4%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "17"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "61.5"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "279.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "45.4"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "403.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "35.6%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "44.0%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "63.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "396.4"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "50.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "527.6"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "27.8%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "33.1%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**ShareGPT**"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "666"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "231"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "327"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "54.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "18.3"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "39.2"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "24.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "37.9%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "31.5%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "58.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "125.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "46.2"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "153.2"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "27.1%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "22.5%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "14"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "61.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "227.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "49.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "273.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "23.8%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "20.7%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**SuperGLUE_BoolQ**"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "207"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "314"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "18.4"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "36.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "26.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "33.4%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "49.8%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "45.6%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "16"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "60.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "229.7"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "43.5"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "303.9"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "38.0%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "32.3%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "32"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "62.7"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "47.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "507.5"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "31.3%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "28.0%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "**AGIEval**"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "735"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "1880"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "53.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "18.7"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "31.8"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "34.1"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "50.3%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "66.8%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "81.9%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "24"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "64.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "381.2"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "43.3"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "629.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "47.8%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "65.0%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "34"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "70.0"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "494.6"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "50.2"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "768.4"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "39.4%"
-msgstr ""
-
-#: ../../source/tutorials/features/suffix_speculative_decoding.md
-msgid "55.3%"
-msgstr ""
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
starkwj	e17006077a	fix multiproc executor determine kv cache memory & update Dockerfile	2026-04-24 12:56:40 +00:00
starkwj	e4d898b245	adapt to vllm-ascend v0.18.0rc1 Some checks failed Merge Conflict Labeler / main (push) Has been cancelled	2026-04-21 03:05:32 +00:00