[v0.18.0][Doc] Translated Doc files 2026-04-14 (#8257)

## Auto-Translation Summary Translated **102** file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/governance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/user_stories/llamafactory.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/testing.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/msprobe_guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/performance_benchmark.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/service_profiling_guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/quick_start.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/graph_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lora.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_features.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ACL_Graph.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/add_custom_aclnn_op.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/performance_and_debug/optimization_and_tuning.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/ray.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/suffix_speculative_decoding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/hardwares/310p.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/hardwares/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-R1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.2.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2.5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/MiniMax-M2.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-7B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-Omni.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-30B-A3B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-32B-W4A4.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-8B-W4A8.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Coder-30B-A3B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Next.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-235B-A22B-Instruct.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-30B-A3B-Instruct.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-Embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-VL-Reranker.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-27B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_reranker.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/using_volcano_kthena.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Multi_Token_Prediction.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lmcache_ascend_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/rfork.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sequence_parallelism.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/speculative_decoding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24390263284) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
2026-04-15 15:27:09 +08:00
parent b6aa5bbdbf
commit 147b589f62
102 changed files with 41760 additions and 6023 deletions
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po
@@ -4,283 +4,518 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/configuration/additional_config.md:1
+#: ../../source/user_guide/configuration/additional_config.md:1
 msgid "Additional Configuration"
 msgstr "附加配置"

-#: ../../user_guide/configuration/additional_config.md:3
+#: ../../source/user_guide/configuration/additional_config.md:3
 msgid ""
-"additional configuration is a mechanism provided by vLLM to allow plugins to"
-" control inner behavior by their own. vLLM Ascend uses this mechanism to "
-"make the project more flexible."
-msgstr "额外配置是 vLLM 提供的一种机制，允许插件自行控制内部行为。vLLM Ascend 利用这种机制使项目更加灵活。"
+"Additional configuration is a mechanism provided by vLLM to allow plugins"
+" to control internal behavior by themselves. VLLM Ascend uses this "
+"mechanism to make the project more flexible."
+msgstr "附加配置是 vLLM 提供的一种机制，允许插件自行控制内部行为。VLLM Ascend 利用此机制使项目更加灵活。"

-#: ../../user_guide/configuration/additional_config.md:5
+#: ../../source/user_guide/configuration/additional_config.md:5
 msgid "How to use"
-msgstr "如何使用"
+msgstr "使用方法"

-#: ../../user_guide/configuration/additional_config.md:7
+#: ../../source/user_guide/configuration/additional_config.md:7
 msgid ""
 "With either online mode or offline mode, users can use additional "
 "configuration. Take Qwen3 as an example:"
-msgstr "无论是在线模式还是离线模式，用户都可以使用额外的配置。以 Qwen3 为例："
+msgstr "无论是在线模式还是离线模式，用户都可以使用附加配置。以 Qwen3 为例："

-#: ../../user_guide/configuration/additional_config.md:9
+#: ../../source/user_guide/configuration/additional_config.md:9
 msgid "**Online mode**:"
 msgstr "**在线模式**："

-#: ../../user_guide/configuration/additional_config.md:15
+#: ../../source/user_guide/configuration/additional_config.md:15
 msgid "**Offline mode**:"
 msgstr "**离线模式**："

-#: ../../user_guide/configuration/additional_config.md:23
+#: ../../source/user_guide/configuration/additional_config.md:23
 msgid "Configuration options"
 msgstr "配置选项"

-#: ../../user_guide/configuration/additional_config.md:25
+#: ../../source/user_guide/configuration/additional_config.md:25
 msgid ""
-"The following table lists the additional configuration options available in "
+"The following table lists additional configuration options available in "
 "vLLM Ascend:"
-msgstr "下表列出了 vLLM Ascend 中可用的其他配置选项："
+msgstr "下表列出了 vLLM Ascend 中可用的附加配置选项："

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "Name"
 msgstr "名称"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "Type"
 msgstr "类型"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "Default"
-msgstr "默认"
+msgstr "默认值"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "Description"
 msgstr "描述"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`torchair_graph_config`"
-msgstr "`torchair_graph_config`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`xlite_graph_config`"
+msgstr "`xlite_graph_config`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "dict"
 msgstr "dict"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 #, python-brace-format
 msgid "`{}`"
 msgstr "`{}`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "The config options for torchair graph mode"
-msgstr "torchair 图模式的配置选项"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Configuration options for Xlite graph mode"
+msgstr "Xlite 图模式的配置选项"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`ascend_scheduler_config`"
-msgstr "`ascend_scheduler_config`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`weight_prefetch_config`"
+msgstr "`weight_prefetch_config`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "The config options for ascend scheduler"
-msgstr "ascend 调度器的配置选项"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Configuration options for weight prefetch"
+msgstr "权重预取的配置选项"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`expert_tensor_parallel_size`"
-msgstr "`expert_tensor_parallel_size`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`finegrained_tp_config`"
+msgstr "`finegrained_tp_config`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "str"
-msgstr "str"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Configuration options for module tensor parallelism"
+msgstr "模块张量并行的配置选项"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`0`"
-msgstr "`0`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`ascend_compilation_config`"
+msgstr "`ascend_compilation_config`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "Expert tensor parallel size the model to use."
-msgstr "专家张量并行的模型大小设置。"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Configuration options for ascend compilation"
+msgstr "昇腾编译的配置选项"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`eplb_config`"
+msgstr "`eplb_config`"
+
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`refresh`"
-msgstr "`刷新`"
+msgstr "`refresh`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "bool"
 msgstr "bool"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`false`"
 msgstr "`false`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid ""
-"Whether to refresh global ascend config content. This value is usually used "
-"by rlhf or ut/e2e test case."
-msgstr "是否刷新全局 ascend 配置信息。此值通常由 rlhf 或 ut/e2e 测试用例使用。"
+"Whether to refresh global Ascend configuration content. This is usually "
+"used by rlhf or ut/e2e test case."
+msgstr "是否刷新全局 Ascend 配置内容。通常由 RLHF 或 UT/E2E 测试用例使用。"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`expert_map_path`"
-msgstr "`expert_map_path`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`dump_config_path`"
+msgstr "`dump_config_path`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "str"
+msgstr "str"
+
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`None`"
 msgstr "`None`"

-#: ../../user_guide/configuration/additional_config.md
-msgid ""
-"When using expert load balancing for the MOE model, an expert map path needs"
-" to be passed in."
-msgstr "在为MOE模型使用专家负载均衡时，需要传入专家映射路径。"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Configuration file path for msprobe dump(eager mode)."
+msgstr "msprobe dump（eager 模式）的配置文件路径。"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_async_exponential`"
+msgstr "`enable_async_exponential`"
+
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`False`"
 msgstr "`False`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "Whether to enable the fused operator-like chunked_prefill."
-msgstr "是否启用类似算子融合的 chunked_prefill 功能。"
-
-#: ../../user_guide/configuration/additional_config.md
-msgid "`kv_cache_dtype`"
-msgstr "`kv_cache_dtype`"
-
-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid ""
-"When using the kv cache quantization method, kv cache dtype needs to be set,"
-" currently only int8 is supported."
-msgstr "当使用kv缓存量化方法时，需要设置kv缓存的数据类型，目前仅支持int8。"
+"Whether to enable asynchronous exponential overlap. To enable "
+"asynchronous exponential, set this config to True."
+msgstr "是否启用异步指数重叠。要启用异步指数，请将此配置设置为 True。"

-#: ../../user_guide/configuration/additional_config.md:37
-msgid "The details of each config option are as follows:"
-msgstr "每个配置选项的详细信息如下："
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_shared_expert_dp`"
+msgstr "`enable_shared_expert_dp`"

-#: ../../user_guide/configuration/additional_config.md:39
-msgid "**torchair_graph_config**"
-msgstr "**torchair_graph_config**"
-
-#: ../../user_guide/configuration/additional_config.md
-msgid "`enabled`"
-msgstr "`启用`"
-
-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid ""
-"Whether to enable torchair graph mode. Currently only DeepSeek series models"
-" and PanguProMoE are supported to use torchair graph mode"
-msgstr "是否启用 torchair 图模式。目前仅支持 DeepSeek 系列模型和 PanguProMoE 使用 torchair 图模式。"
+"When the expert is shared in DP, it delivers better performance but "
+"consumes more memory. Currently only DeepSeek series models are "
+"supported."
+msgstr "当专家在 DP 中共享时，可获得更好的性能但会消耗更多内存。目前仅支持 DeepSeek 系列模型。"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`enable_multistream_mla`"
-msgstr "`enable_multistream_mla`"
-
-#: ../../user_guide/configuration/additional_config.md
-msgid ""
-"Whether to put vector ops of MLA to another stream. This option only takes "
-"effects on models using MLA (e.g., DeepSeek)."
-msgstr "是否将MLA的向量操作放到另一个流中。此选项仅对使用MLA的模型（例如，DeepSeek）有效。"
-
-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`multistream_overlap_shared_expert`"
 msgstr "`multistream_overlap_shared_expert`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid ""
-"Whether to enable multistream shared expert. This option only takes effects "
-"on DeepSeek moe models."
-msgstr "是否启用多流共享专家功能。此选项仅对 DeepSeek MoE 模型生效。"
+"Whether to enable multi-stream shared expert. This option only takes "
+"effect on MoE models with shared experts."
+msgstr "是否启用多流共享专家。此选项仅对具有共享专家的 MoE 模型生效。"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`enable_view_optimize`"
-msgstr "`enable_view_optimize` （启用视图优化）"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`multistream_overlap_gate`"
+msgstr "`multistream_overlap_gate`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable multi-stream overlap gate. This option only takes "
+"effect on MoE models with shared experts."
+msgstr "是否启用多流重叠门。此选项仅对具有共享专家的 MoE 模型生效。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`recompute_scheduler_enable`"
+msgstr "`recompute_scheduler_enable`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable recompute scheduler."
+msgstr "是否启用重计算调度器。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_cpu_binding`"
+msgstr "`enable_cpu_binding`"
+
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`True`"
 msgstr "`True`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "Whether to enable torchair view optimization"
-msgstr "是否启用torchair视图优化"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable CPU binding. Only takes effect on ARM CPUs; A3 uses the"
+" global-slicing CPU allocation strategy and other device types use the "
+"topo-affinity CPU allocation strategy."
+msgstr "是否启用 CPU 绑定。仅在 ARM CPU 上生效；A3 使用全局切片 CPU 分配策略，其他设备类型使用拓扑亲和性 CPU 分配策略。"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`use_cached_graph`"
-msgstr "`use_cached_graph`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`SLO_limits_for_dynamic_batch`"
+msgstr "`SLO_limits_for_dynamic_batch`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "Whether to use cached graph"
-msgstr "是否使用缓存的图"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "int"
+msgstr "int"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`graph_batch_sizes`"
-msgstr "`graph_batch_sizes`"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`-1`"
+msgstr "`-1`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "list[int]"
-msgstr "list[int]"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"SLO limits for dynamic batch. This is new scheduler to support dynamic "
+"batch feature"
+msgstr "动态批处理的 SLO 限制。这是支持动态批处理功能的新调度器。"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_npugraph_ex`"
+msgstr "`enable_npugraph_ex`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable npugraph_ex graph mode."
+msgstr "是否启用 npugraph_ex 图模式。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`pa_shape_list`"
+msgstr "`pa_shape_list`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "list"
+msgstr "list"
+
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`[]`"
 msgstr "`[]`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "The batch size for torchair graph cache"
-msgstr "torchair 图缓存的批量大小"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "The custom shape list of page attention ops."
+msgstr "页面注意力算子的自定义形状列表。"

-#: ../../user_guide/configuration/additional_config.md
-msgid "`graph_batch_sizes_init`"
-msgstr "`graph_batch_sizes_init`"
-
-#: ../../user_guide/configuration/additional_config.md
-msgid "Init graph batch size dynamically if `graph_batch_sizes` is empty"
-msgstr "如果 `graph_batch_sizes` 为空，则动态初始化图批大小"
-
-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid "`enable_kv_nz`"
 msgstr "`enable_kv_nz`"

-#: ../../user_guide/configuration/additional_config.md
+#: ../../source/user_guide/configuration/additional_config.md
 msgid ""
-"Whether to enable kvcache NZ layout. This option only takes effects on "
+"Whether to enable KV cache NZ layout. This option only takes effects on "
 "models using MLA (e.g., DeepSeek)."
-msgstr "是否启用 kvcache NZ 布局。此选项仅对使用 MLA 的模型（例如 DeepSeek）生效。"
+msgstr "是否启用 KV 缓存 NZ 布局。此选项仅对使用 MLA 的模型（例如 DeepSeek）生效。"

-#: ../../user_guide/configuration/additional_config.md:52
-msgid "**ascend_scheduler_config**"
-msgstr "**ascend_scheduler_config**"
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`layer_sharding`"
+msgstr "`layer_sharding`"

-#: ../../user_guide/configuration/additional_config.md
-msgid "Whether to enable ascend scheduler for V1 engine"
-msgstr "是否为 V1 引擎启用 ascend 调度器"
-
-#: ../../user_guide/configuration/additional_config.md:58
+#: ../../source/user_guide/configuration/additional_config.md
 msgid ""
-"ascend_scheduler_config also support the options from [vllm scheduler "
-"config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig)."
-" For example, you can add `enable_chunked_prefill: True` to "
-"ascend_scheduler_config as well."
-msgstr ""
-"ascend_scheduler_config 也支持来自 [vllm scheduler "
-"config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig)"
-" 的选项。例如，你也可以在 ascend_scheduler_config 中添加 `enable_chunked_prefill: True`。"
+"Configuration options for Layer Sharding Linear. In PD-disaggregated "
+"deployments, it is supported only on P nodes with "
+"`kv_role=\"kv_producer\"`."
+msgstr "层分片线性层的配置选项。在 PD 解耦部署中，仅支持在 `kv_role=\"kv_producer\"` 的 P 节点上使用。"

-#: ../../user_guide/configuration/additional_config.md:60
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_sparse_c8`"
+msgstr "`enable_sparse_c8`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable KV cache C8 in DSA models (e.g., DeepSeekV3.2 and "
+"GLM5). Not supported on A5 devices now"
+msgstr "是否在 DSA 模型（例如 DeepSeekV3.2 和 GLM5）中启用 KV 缓存 C8。目前 A5 设备不支持。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_mc2_hierarchy_comm`"
+msgstr "`enable_mc2_hierarchy_comm`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Enable dispatch/combine op inter-node communication by ROCE."
+msgstr "通过 ROCE 启用分发/组合算子的节点间通信。"
+
+#: ../../source/user_guide/configuration/additional_config.md:50
+msgid "The details of each configuration option are as follows:"
+msgstr "每个配置选项的详细信息如下："
+
+#: ../../source/user_guide/configuration/additional_config.md:52
+msgid "**xlite_graph_config**"
+msgstr "**xlite_graph_config**"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enabled`"
+msgstr "`enabled`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable Xlite graph mode. Currently only Llama, Qwen dense "
+"series models, and Qwen3-VL are supported."
+msgstr "是否启用 Xlite 图模式。目前仅支持 Llama、Qwen 稠密系列模型和 Qwen3-VL。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`full_mode`"
+msgstr "`full_mode`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable Xlite for both the prefill and decode stages. By "
+"default, Xlite is only enabled for the decode stage."
+msgstr "是否在预填充和解码阶段都启用 Xlite。默认情况下，Xlite 仅对解码阶段启用。"
+
+#: ../../source/user_guide/configuration/additional_config.md:59
+msgid "**weight_prefetch_config**"
+msgstr "**weight_prefetch_config**"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable weight prefetch."
+msgstr "是否启用权重预取。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`prefetch_ratio`"
+msgstr "`prefetch_ratio`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+#, python-brace-format
+msgid ""
+"`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, "
+"\"mlp\": { \"gate_up\": 1.0,  \"down\": 1.0}}`"
+msgstr "`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, \"mlp\": { \"gate_up\": 1.0,  \"down\": 1.0}}`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Prefetch ratio of each weight."
+msgstr "各权重的预取比例。"
+
+#: ../../source/user_guide/configuration/additional_config.md:66
+msgid "**finegrained_tp_config**"
+msgstr "**finegrained_tp_config**"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`lmhead_tensor_parallel_size`"
+msgstr "`lmhead_tensor_parallel_size`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`0`"
+msgstr "`0`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "The custom tensor parallel size of lm_head."
+msgstr "lm_head 的自定义张量并行大小。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`oproj_tensor_parallel_size`"
+msgstr "`oproj_tensor_parallel_size`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "The custom tensor parallel size of o_proj."
+msgstr "o_proj 的自定义张量并行大小。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`embedding_tensor_parallel_size`"
+msgstr "`embedding_tensor_parallel_size`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "The custom tensor parallel size of embedding."
+msgstr "embedding 的自定义张量并行大小。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`mlp_tensor_parallel_size`"
+msgstr "`mlp_tensor_parallel_size`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "The custom tensor parallel size of mlp."
+msgstr "mlp 的自定义张量并行大小。"
+
+#: ../../source/user_guide/configuration/additional_config.md:75
+msgid "**ascend_compilation_config**"
+msgstr "**ascend_compilation_config**"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable npugraph_ex backend."
+msgstr "是否启用 npugraph_ex 后端。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`enable_static_kernel`"
+msgstr "`enable_static_kernel`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable static kernel. Suitable for scenarios where shape "
+"changes are minimal and some time is available for static kernel "
+"compilation."
+msgstr "是否启用静态内核。适用于形状变化极小且有时间为静态内核编译的场景。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`fuse_norm_quant`"
+msgstr "`fuse_norm_quant`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable fuse_norm_quant pass."
+msgstr "是否启用 fuse_norm_quant 优化过程。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`fuse_qknorm_rope`"
+msgstr "`fuse_qknorm_rope`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable fuse_qknorm_rope pass. If Triton is not in the "
+"environment, set it to False."
+msgstr "是否启用 fuse_qknorm_rope 优化过程。如果环境中没有 Triton，请将其设置为 False。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`fuse_allreduce_rms`"
+msgstr "`fuse_allreduce_rms`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Whether to enable fuse_allreduce_rms pass. It's set to False because of "
+"conflict with SP."
+msgstr "是否启用 fuse_allreduce_rms 优化过程。由于与 SP 冲突，默认设置为 False。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`fuse_muls_add`"
+msgstr "`fuse_muls_add`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable fuse_muls_add pass."
+msgstr "是否启用 fuse_muls_add 优化过程。"
+
+#: ../../source/user_guide/configuration/additional_config.md:86
+msgid "**eplb_config**"
+msgstr "**eplb_config**"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`dynamic_eplb`"
+msgstr "`dynamic_eplb`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Whether to enable dynamic EPLB."
+msgstr "是否启用动态 EPLB。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`expert_map_path`"
+msgstr "`expert_map_path`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"When using expert load balancing for an MoE model, an expert map path "
+"needs to be passed in."
+msgstr "为 MoE 模型使用专家负载均衡时，需要传入专家映射路径。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`expert_heat_collection_interval`"
+msgstr "`expert_heat_collection_interval`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`400`"
+msgstr "`400`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Forward iterations when EPLB begins."
+msgstr "EPLB 开始时的前向传播迭代次数。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`algorithm_execution_interval`"
+msgstr "`algorithm_execution_interval`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`30`"
+msgstr "`30`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "The forward iterations when the EPLB worker will finish CPU tasks."
+msgstr "EPLB 工作进程完成 CPU 任务所需的前向传播迭代次数。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`expert_map_record_path`"
+msgstr "`expert_map_record_path`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid ""
+"Save the expert load calculation results to a new expert table in the "
+"specified directory."
+msgstr "将专家负载计算结果保存到指定目录下的新专家表中。"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "`num_redundant_experts`"
+msgstr "`num_redundant_experts`"
+
+#: ../../source/user_guide/configuration/additional_config.md
+msgid "Specify redundant experts during initialization."
+msgstr "在初始化时指定冗余专家数量。"
+
+#: ../../source/user_guide/configuration/additional_config.md:97
 msgid "Example"
 msgstr "示例"

-#: ../../user_guide/configuration/additional_config.md:62
+#: ../../source/user_guide/configuration/additional_config.md:99
 msgid "An example of additional configuration is as follows:"
 msgstr "以下是额外配置的一个示例："
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/index.po
@@ -0,0 +1,25 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/deployment_guide/index.md:1
+#: ../../source/user_guide/deployment_guide/index.md:3
+msgid "Deployment Guide"
+msgstr "部署指南"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/using_volcano_kthena.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/deployment_guide/using_volcano_kthena.po
@@ -0,0 +1,293 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:1
+msgid "Using Volcano Kthena"
+msgstr "使用 Volcano Kthena"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:3
+msgid ""
+"This guide shows how to run **prefill–decode (PD) disaggregation** on "
+"Huawei Ascend NPUs using **vLLM-Ascend**, with "
+"[**Kthena**](https://kthena.volcano.sh/) handling orchestration on "
+"Kubernetes. About vLLM support with Kthena, please refer to [Deploy vLLM "
+"with "
+"Kthena](https://docs.vllm.ai/en/latest/deployment/integrations/kthena/)."
+msgstr ""
+"本指南展示了如何在华为昇腾 NPU 上使用 **vLLM-Ascend** 运行**预填充-解码（PD）解耦**，并由 "
+"[**Kthena**](https://kthena.volcano.sh/) 在 Kubernetes 上处理编排。关于 vLLM 与 Kthena 的集成支持，请参阅[使用 "
+"Kthena 部署 vLLM](https://docs.vllm.ai/en/latest/deployment/integrations/kthena/)。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:7
+msgid "1. What is Prefill–Decode Disaggregation?"
+msgstr "1. 什么是预填充-解码解耦？"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:9
+msgid "Large language model inference naturally splits into two phases:"
+msgstr "大语言模型推理自然分为两个阶段："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:11
+msgid "**Prefill**"
+msgstr "**预填充**"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:12
+msgid "Processes input tokens and builds the key–value (KV) cache."
+msgstr "处理输入令牌并构建键值（KV）缓存。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:13
+msgid "Batch-friendly, high-throughput, well-suited to parallel NPU execution."
+msgstr "批处理友好，高吞吐量，非常适合并行 NPU 执行。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:14
+msgid "**Decode**"
+msgstr "**解码**"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:15
+msgid "Consumes the KV cache to generate output tokens."
+msgstr "消耗 KV 缓存以生成输出令牌。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:16
+msgid "Latency-sensitive, memory-intensive, more sequential."
+msgstr "延迟敏感，内存密集型，更具顺序性。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:18
+msgid ""
+"From the client's perspective, this still looks like a single Chat / "
+"Completions endpoint."
+msgstr "从客户端的角度来看，这仍然像一个单一的聊天/补全端点。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:22
+msgid "2. Deploy on Kubernetes with Kthena"
+msgstr "2. 使用 Kthena 在 Kubernetes 上部署"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:24
+msgid ""
+"[Kthena](https://kthena.volcano.sh/) is a Kubernetes-native LLM inference"
+" platform that transforms how organizations deploy and manage Large "
+"Language Models in production. Built with declarative model lifecycle "
+"management and intelligent request routing, it provides high-performance "
+"and enterprise-grade scalability for LLM inference workloads. In this "
+"example, we use three key Custom Resource Definitions (CRDs):"
+msgstr ""
+"[Kthena](https://kthena.volcano.sh/) 是一个 Kubernetes 原生的 LLM 推理平台，它改变了组织在生产环境中部署和管理大语言模型的方式。它基于声明式模型生命周期管理和智能请求路由构建，为 LLM 推理工作负载提供高性能和企业级的可扩展性。在本示例中，我们使用三个关键的自定义资源定义（CRD）："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:26
+msgid "`ModelServing` — defines the workloads (prefill and decode roles)."
+msgstr "`ModelServing` — 定义工作负载（预填充和解码角色）。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:27
+msgid "`ModelServer` — manages PD groupings and internal routing."
+msgstr "`ModelServer` — 管理 PD 分组和内部路由。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:28
+msgid "`ModelRoute` — exposes a stable model endpoint."
+msgstr "`ModelRoute` — 暴露一个稳定的模型端点。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:30
+msgid ""
+"This section uses the `deepseek-ai/DeepSeek-V2-Lite` example, but you can"
+" swap in any model supported by vLLM-Ascend."
+msgstr "本节使用 `deepseek-ai/DeepSeek-V2-Lite` 示例，但您可以替换为 vLLM-Ascend 支持的任何模型。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:32
+msgid "2.1 Prerequisites"
+msgstr "2.1 先决条件"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:34
+msgid "Kubernetes cluster with Ascend NPU nodes:"
+msgstr "包含昇腾 NPU 节点的 Kubernetes 集群："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:36
+msgid ""
+"The resources corresponding to different NPU Drivers may vary slightly. "
+"For example:"
+msgstr "不同 NPU 驱动对应的资源可能略有不同。例如："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:38
+#, python-format
+msgid ""
+"If using [MindCluster](https://gitee.com/ascend/mind-"
+"cluster#https://gitee.com/link?target=https%3A%2F%2Fgitcode.com%2FAscend"
+"%2Fmind-cluster), please use `huawei.com/Ascend310P` or "
+"`huawei.com/Ascend910`."
+msgstr ""
+"如果使用 [MindCluster](https://gitee.com/ascend/mind-"
+"cluster#https://gitee.com/link?target=https%3A%2F%2Fgitcode.com%2FAscend%2Fmind-cluster)，请使用 "
+"`huawei.com/Ascend310P` 或 `huawei.com/Ascend910`。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:40
+msgid ""
+"If running on CCE (Cloud Container Engine) of Huawei Cloud and the [CCE "
+"AI Suite Plugin (Ascend NPU)](https://support.huaweicloud.com/intl/en-us"
+"/usermanual-cce/cce_10_0239.html) is installed, please use "
+"`huawei.com/ascend-310` or `huawei.com/ascend-1980`."
+msgstr ""
+"如果在华为云的 CCE（云容器引擎）上运行并且安装了 [CCE AI 套件插件（昇腾 "
+"NPU）](https://support.huaweicloud.com/intl/en-us/usermanual-cce/cce_10_0239.html)，请使用 "
+"`huawei.com/ascend-310` 或 `huawei.com/ascend-1980`。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:42
+msgid ""
+"Kthena installed. Please follow the [Kthena installation "
+"guide](https://kthena.volcano.sh/docs/getting-started/installation)."
+msgstr "已安装 Kthena。请遵循 [Kthena 安装指南](https://kthena.volcano.sh/docs/getting-started/installation)。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:44
+msgid "2.2 Deploy Prefill-Decode Disaggregated DeepSeek-V2-Lite on Kubernetes"
+msgstr "2.2 在 Kubernetes 上部署预填充-解码解耦的 DeepSeek-V2-Lite"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:46
+msgid ""
+"A concrete example is provided in Kthena as <https://github.com/volcano-"
+"sh/kthena/blob/main/examples/model-serving/prefill-decode-"
+"disaggregation.yaml>"
+msgstr ""
+"Kthena 中提供了一个具体示例：<https://github.com/volcano-sh/kthena/blob/main/examples/model-"
+"serving/prefill-decode-disaggregation.yaml>"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:48
+msgid "Deploy it with the command below:"
+msgstr "使用以下命令部署："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:54
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:315
+msgid "or"
+msgstr "或"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:293
+msgid "You should see Pods such as:"
+msgstr "您应该会看到类似以下的 Pod："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:295
+msgid "`deepseek-v2-lite-0-prefill-0-0`"
+msgstr "`deepseek-v2-lite-0-prefill-0-0`"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:296
+msgid "`deepseek-v2-lite-0-decode-0-0`"
+msgstr "`deepseek-v2-lite-0-decode-0-0`"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:298
+msgid ""
+"To enable the LLM access, we still need to configure the routing layer "
+"with `ModelServer` and `ModelRoute`."
+msgstr "要启用 LLM 访问，我们仍然需要使用 `ModelServer` 和 `ModelRoute` 配置路由层。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:300
+msgid "2.3 ModelServer: PD Group Management"
+msgstr "2.3 ModelServer：PD 分组管理"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:302
+msgid "The `ModelServer` resource:"
+msgstr "`ModelServer` 资源："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:304
+msgid "Selects the `ModelServing` workloads via labels."
+msgstr "通过标签选择 `ModelServing` 工作负载。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:305
+msgid "Groups prefill and decode Pods into PD pairs."
+msgstr "将预填充和解码 Pod 分组为 PD 对。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:306
+msgid "Configures KV connector details and timeouts."
+msgstr "配置 KV 连接器详细信息和超时设置。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:307
+msgid "Exposes an internal gRPC/HTTP interface."
+msgstr "暴露一个内部的 gRPC/HTTP 接口。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:309
+msgid "Create ModelServer with the command below:"
+msgstr "使用以下命令创建 ModelServer："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:345
+msgid "2.4 ModelRoute: User-Facing Endpoint"
+msgstr "2.4 ModelRoute：面向用户的端点"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:347
+msgid ""
+"The `ModelRoute` resource maps a model name (e.g., `\"deepseek-"
+"ai/DeepSeekV2\"`) to the `ModelServer`."
+msgstr "`ModelRoute` 资源将模型名称（例如 `\"deepseek-ai/DeepSeekV2\"`）映射到 `ModelServer`。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:349
+msgid "Example manifest:"
+msgstr "示例清单："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:369
+msgid "3. Verification"
+msgstr "3. 验证"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:371
+msgid "3.1 Check Workloads"
+msgstr "3.1 检查工作负载"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:373
+msgid "Confirm that prefill and decode Pods are up:"
+msgstr "确认预填充和解码 Pod 已启动："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:382
+msgid "You should see both roles in `Running` and `Ready` state."
+msgstr "您应该看到两个角色都处于 `Running` 和 `Ready` 状态。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:384
+msgid "3.2 Test the Chat Endpoint"
+msgstr "3.2 测试聊天端点"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:386
+msgid ""
+"Once routing is configured, you can send a test request to the Kthena-"
+"router:"
+msgstr "路由配置完成后，您可以向 Kthena-router 发送测试请求："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:406
+msgid "A successful JSON response confirms that:"
+msgstr "成功的 JSON 响应确认了："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:408
+msgid "The prefill and decode services are both running on Ascend NPUs."
+msgstr "预填充和解码服务都在昇腾 NPU 上运行。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:409
+msgid "KV transfer between them is working."
+msgstr "它们之间的 KV 传输正常工作。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:410
+msgid "The Kthena routing layer is correctly fronting the vLLM-Ascend plugin."
+msgstr "Kthena 路由层正确地作为 vLLM-Ascend 插件的前端。"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:414
+msgid "4. Cleanup"
+msgstr "4. 清理"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:416
+msgid "To remove the deployment:"
+msgstr "要移除部署："
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:431
+msgid "5. Summary"
+msgstr "5. 总结"
+
+#: ../../source/user_guide/deployment_guide/using_volcano_kthena.md:433
+msgid ""
+"For more advanced features, please refer to the [Kthena "
+"website](https://kthena.volcano.sh/)."
+msgstr "有关更多高级功能，请参阅 [Kthena 网站](https://kthena.volcano.sh/)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po
@@ -0,0 +1,307 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:1
+msgid "Fine-Grained Tensor Parallelism (Finegrained TP)"
+msgstr "细粒度张量并行 (Finegrained TP)"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:5
+msgid ""
+"Fine-Grained Tensor Parallelism (Fine-grained TP) extends standard tensor"
+" parallelism by enabling **independent tensor-parallel sizes for "
+"different model components**. Instead of applying a single global "
+"`tensor_parallel_size` to all layers, Fine-grained TP allows users to "
+"configure separate TP sizes for key modules—such as embedding, language "
+"model head (lm_head), attention output projection (o_proj), and MLP "
+"blocks—via the `finegrained_tp_config` parameter."
+msgstr ""
+"细粒度张量并行 (Fine-grained TP) 扩展了标准张量并行，允许为**不同的模型组件设置独立的张量并行规模**。与对所有层应用单一的全局 `tensor_parallel_size` 不同，细粒度 TP 允许用户通过 `finegrained_tp_config` 参数为关键模块（如嵌入层、语言模型头部 (lm_head)、注意力输出投影层 (o_proj) 和 MLP 块）配置独立的 TP 规模。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:7
+msgid ""
+"This capability supports heterogeneous parallelism strategies within a "
+"single model, providing finer control over weight distribution, memory "
+"layout, and communication patterns across devices. The feature is "
+"compatible with standard dense transformer architectures and integrates "
+"seamlessly into vLLM’s serving pipeline."
+msgstr ""
+"此功能支持在单个模型内使用异构并行策略，从而能更精细地控制跨设备的权重分布、内存布局和通信模式。该特性与标准的密集 Transformer 架构兼容，并能无缝集成到 vLLM 的服务流水线中。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:11
+msgid "Benefits of Finegrained TP"
+msgstr "细粒度 TP 的优势"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:13
+msgid ""
+"Fine-Grained Tensor Parallelism delivers two primary performance "
+"advantages through targeted weight sharding:"
+msgstr "细粒度张量并行通过有针对性的权重分片带来两个主要的性能优势："
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:15
+msgid ""
+"**Reduced Per-Device Memory Footprint**:   Fine-grained TP shards large "
+"weight matrices(e.g., LM Head, o_proj)across devices, lowering peak "
+"memory usage and enabling larger batches or deployment on memory-limited "
+"hardware—without quantization."
+msgstr ""
+"**降低单设备内存占用**：   细粒度 TP 将大型权重矩阵（例如 LM Head、o_proj）分片到多个设备上，降低了峰值内存使用量，从而支持更大的批次或在内存受限的硬件上进行部署——无需量化。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:18
+msgid ""
+"**Faster Memory Access in GEMMs**:   In decode-heavy workloads, GEMM "
+"performance is often memory-bound. Weight sharding reduces per-device "
+"weight fetch volume, cutting DRAM traffic and improving bandwidth "
+"efficiency—especially for latency-sensitive layers like LM Head and "
+"o_proj."
+msgstr ""
+"**加速 GEMM 中的内存访问**：   在解码密集型工作负载中，GEMM 性能通常受内存带宽限制。权重分片减少了每个设备需要获取的权重数据量，从而降低了 DRAM 流量并提高了带宽效率——对于 LM Head 和 o_proj 等延迟敏感层尤其如此。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:21
+msgid ""
+"Together, these effects allow practitioners to better balance memory, "
+"communication, and compute—particularly in high-concurrency serving "
+"scenarios—while maintaining compatibility with standard dense transformer"
+" models."
+msgstr "综合来看，这些效果使实践者能够更好地平衡内存、通信和计算——尤其是在高并发服务场景中——同时保持与标准密集 Transformer 模型的兼容性。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:25
+msgid "Supported Scenarios"
+msgstr "支持场景"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:27
+msgid "Models"
+msgstr "模型"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:29
+msgid ""
+"Fine-grained TP is **model-agnostic** and supports all standard dense "
+"transformer architectures, including Llama, Qwen, DeepSeek (base/dense "
+"variants), and others."
+msgstr "细粒度 TP 是**模型无关的**，支持所有标准的密集 Transformer 架构，包括 Llama、Qwen、DeepSeek（基础/密集变体）等。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:31
+msgid "Component & Execution Mode Support"
+msgstr "组件与执行模式支持"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "TP config"
+msgstr "TP 配置"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Eager"
+msgstr "Eager"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Graph"
+msgstr "Graph"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Hybrid"
+msgstr "Hybrid"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Prefill"
+msgstr "Prefill"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Decode"
+msgstr "Decode"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**embedding**"
+msgstr "**embedding**"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "✅"
+msgstr "✅"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**o_proj**"
+msgstr "**o_proj**"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "❌"
+msgstr "❌"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**mlp**"
+msgstr "**mlp**"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**LMhead**"
+msgstr "**LMhead**"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:40
+msgid "⚠️ Note:"
+msgstr "⚠️ 注意："
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:42
+msgid ""
+"`o_proj` TP is only supported in Graph mode during Decode, because "
+"dummy_run in eager mode will not trigger o_proj."
+msgstr "`o_proj` TP 仅在 Decode 阶段的 Graph 模式下受支持，因为 eager 模式下的 dummy_run 不会触发 o_proj。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:43
+msgid ""
+"`mlp` TP supports dense models, or dense layers in MoE models. For "
+"example, the first three dense layers of DeepSeek-R1."
+msgstr "`mlp` TP 支持密集模型，或 MoE 模型中的密集层。例如，DeepSeek-R1 的前三个密集层。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:45
+msgid "Configuration Limit"
+msgstr "配置限制"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:47
+msgid "The Fine-Grained TP size for any component must:"
+msgstr "任何组件的细粒度 TP 规模必须满足："
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:49
+msgid "Be **≤ the data-parallel (DP) size**, and"
+msgstr "**≤ 数据并行 (DP) 规模**，并且"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:50
+msgid ""
+"**Evenly divide the DP size** (i.e., `dp_size % tp_size == 0`) to ensure "
+"valid device assignment and communication grouping."
+msgstr "**能整除 DP 规模**（即 `dp_size % tp_size == 0`），以确保有效的设备分配和通信分组。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:52
+msgid ""
+"⚠️ Violating these constraints will result in runtime errors or undefined"
+" behavior."
+msgstr "⚠️ 违反这些约束将导致运行时错误或未定义行为。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:56
+msgid "How to Use Finegrained TP"
+msgstr "如何使用细粒度 TP"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:58
+msgid "Configuration Format"
+msgstr "配置格式"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:60
+msgid ""
+"Fine-grained TP is controlled via the `finegrained_tp_config` field "
+"inside `--additional-config`."
+msgstr "细粒度 TP 通过 `--additional-config` 内的 `finegrained_tp_config` 字段控制。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:73
+msgid "Example Usage"
+msgstr "使用示例"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:91
+msgid "Experimental Results"
+msgstr "实验结果"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:93
+msgid ""
+"To evaluate the effectiveness of fine-grained TP in large-scale service "
+"scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated "
+"decode instances in an environment of 32 cards Ascend 910B*64G (A2), with"
+" parallel configuration as DP32+EP32, and fine-grained TP size of 8; the "
+"performance data is as follows."
+msgstr "为评估细粒度 TP 在大规模服务场景中的有效性，我们使用模型 **DeepSeek-R1-W8A8**，在 32 卡 Ascend 910B*64G (A2) 环境中部署 PD 分离的解码实例，并行配置为 DP32+EP32，细粒度 TP 规模为 8；性能数据如下。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Module"
+msgstr "模块"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Memory Savings"
+msgstr "内存节省"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "TPOT Impact (batch=24)"
+msgstr "TPOT 影响 (batch=24)"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "o_proj TP = 8"
+msgstr "o_proj TP = 8"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "5.8 GB"
+msgstr "5.8 GB"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**+1.5 ms** (degradation)"
+msgstr "**+1.5 ms** (性能下降)"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "LM head TP = 8"
+msgstr "LM head TP = 8"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "1.51 GB"
+msgstr "1.51 GB"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**−1.2 ms** (improvement)"
+msgstr "**−1.2 ms** (性能提升)"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "FFN TP = 8"
+msgstr "FFN TP = 8"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "0.9 GB"
+msgstr "0.9 GB"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**−1.0 ms** (improvement)"
+msgstr "**−1.0 ms** (性能提升)"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "Embedding TP = 8"
+msgstr "Embedding TP = 8"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**Total**"
+msgstr "**总计**"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "**9.72 GB**"
+msgstr "**9.72 GB**"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
+msgid "—"
+msgstr "—"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:103
+msgid ""
+"We achieved significant gains in terms of high memory capacity on a "
+"single card, as well as the benefits of TPOT."
+msgstr "我们在单卡高内存容量以及 TPOT 优势方面取得了显著收益。"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:107
+msgid "✅ Deployment Recommendations"
+msgstr "✅ 部署建议"
+
+#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:109
+msgid ""
+"Fine-grained TP is the **most effective** in the **decode instance** of "
+"PD separation, where models are typically deployed in all-DP mode. In "
+"this setup, sharding weight-heavy layers reduces redundant storage and "
+"memory pressure."
+msgstr "细粒度 TP 在 PD 分离的**解码实例**中**最有效**，因为模型通常以全 DP 模式部署。在此设置中，对权重密集的层进行分片可以减少冗余存储和内存压力。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Multi_Token_Prediction.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Multi_Token_Prediction.po
@@ -0,0 +1,233 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:1
+msgid "Multi Token Prediction (MTP)"
+msgstr "多令牌预测 (MTP)"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:3
+msgid "Why We Need MTP"
+msgstr "为何需要 MTP"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:5
+msgid ""
+"MTP boosts inference performance by parallelizing the prediction of "
+"multiple tokens, shifting from single-token to multi-token generation. "
+"This approach significantly increases generation throughput and achieves "
+"multiplicative acceleration in inference speed—all without compromising "
+"output quality."
+msgstr ""
+"MTP 通过并行预测多个令牌来提升推理性能，从单令牌生成转向多令牌生成。这种方法显著提高了生成吞吐量，并在不牺牲输出质量的前提下，实现了推理速度的倍增加速。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:7
+msgid "How to Use MTP"
+msgstr "如何使用 MTP"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:9
+msgid ""
+"To enable MTP for DeepSeek-V3 models, add the following parameter when "
+"starting the service:"
+msgstr "要为 DeepSeek-V3 模型启用 MTP，请在启动服务时添加以下参数："
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:11
+#, python-brace-format
+msgid ""
+"--speculative_config ' {\"method\": \"mtp\", \"num_speculative_tokens\": "
+"1, \"disable_padded_drafter_batch\": False} '"
+msgstr ""
+"--speculative_config ' {\"method\": \"mtp\", \"num_speculative_tokens\": "
+"1, \"disable_padded_drafter_batch\": False} '"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:13
+msgid ""
+"`num_speculative_tokens`: The number of speculative tokens that enables "
+"the model to predict multiple tokens at once, if provided. It will "
+"default to the number in the draft model config if present, otherwise, it"
+" is required."
+msgstr ""
+"`num_speculative_tokens`：推测性令牌的数量，如果提供，则使模型能够一次预测多个令牌。如果草稿模型配置中存在此值，则默认使用该值，否则必须提供。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:14
+msgid ""
+"`disable_padded_drafter_batch`: Disable input padding for speculative "
+"decoding. If set to True, speculative input batches can contain sequences"
+" of different lengths, which may only be supported by certain attention "
+"backends. This currently only affects the MTP method of speculation, "
+"default is False."
+msgstr ""
+"`disable_padded_drafter_batch`：禁用推测解码的输入填充。如果设置为 True，推测输入批次可以包含不同长度的序列，这可能仅受某些注意力后端支持。目前这仅影响 MTP 推测方法，默认值为 False。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:16
+msgid "How It Works"
+msgstr "工作原理"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:18
+msgid "Module Architecture"
+msgstr "模块架构"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:29
+msgid "**1. sample**"
+msgstr "**1. 采样**"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:31
+msgid ""
+"*rejection_sample.py*: During decoding, the main model processes the "
+"previous round’s output token and the predicted token together (computing"
+" 1+k tokens simultaneously). The first token is always correct, while the"
+" second token—referred to as the **bonus token**—is uncertain since it is"
+" derived from speculative prediction, thus we employ **Greedy Strategy** "
+"and **Rejection Sampling Strategy** to determine whether the bonus token "
+"should be accepted. The module structure consists of an "
+"`AscendRejectionSampler` class with a forward method that implements the "
+"specific sampling logic."
+msgstr ""
+"*rejection_sample.py*：在解码过程中，主模型同时处理上一轮的输出令牌和预测的令牌（同时计算 1+k 个令牌）。第一个令牌总是正确的，而第二个令牌（称为**奖励令牌**）则不确定，因为它源自推测性预测，因此我们采用**贪婪策略**和**拒绝采样策略**来决定是否应接受该奖励令牌。该模块结构包含一个 `AscendRejectionSampler` 类，其 forward 方法实现了具体的采样逻辑。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:39
+msgid "**2. spec_decode**"
+msgstr "**2. spec_decode**"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:41
+msgid ""
+"This section encompasses the model preprocessing for spec-decode, "
+"primarily structured as follows: it includes loading the model, executing"
+" a dummy run, and generating token IDs. These steps collectively form the"
+" model data construction and forward invocation for a single spec-decode "
+"operation."
+msgstr "本节涵盖了 spec-decode 的模型预处理，主要结构如下：包括加载模型、执行虚拟运行以及生成令牌 ID。这些步骤共同构成了单次 spec-decode 操作的模型数据构建和前向调用。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:43
+msgid ""
+"*mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding "
+"where proposals are generated by DeepSeek MTP layer."
+msgstr "*mtp_proposer.py*：配置 vLLM-Ascend 使用推测解码，其中提议由 DeepSeek MTP 层生成。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:54
+msgid "Algorithm"
+msgstr "算法"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:56
+msgid "**1. Rejection Sampling**"
+msgstr "**1. 拒绝采样**"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:58
+msgid "*Greedy Strategy*"
+msgstr "*贪婪策略*"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:60
+msgid ""
+"Verify whether the token generated by the main model matches the "
+"speculative token predicted by MTP in the previous round. If they match "
+"exactly, accept the bonus token; otherwise, reject it and any subsequent "
+"tokens derived from that speculation."
+msgstr "验证主模型生成的令牌是否与上一轮 MTP 预测的推测令牌匹配。如果完全匹配，则接受奖励令牌；否则，拒绝该令牌以及源自该推测的任何后续令牌。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:62
+msgid "*Rejection Sampling Strategy*"
+msgstr "*拒绝采样策略*"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:64
+msgid "This method introduces stochasticity in rejection sampling."
+msgstr "此方法在拒绝采样中引入了随机性。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:66
+msgid ""
+"For each draft token, acceptance is determined by verifying whether the "
+"inequality `P_target / P_draft ≥ U` holds, where `P_target` represents "
+"the probability assigned to the current draft token by the target model, "
+"`P_draft` denotes the probability assigned by the draft model, and `U` is"
+" a random number sampled uniformly from the interval [0, 1)."
+msgstr "对于每个草稿令牌，通过验证不等式 `P_target / P_draft ≥ U` 是否成立来决定是否接受，其中 `P_target` 表示目标模型分配给当前草稿令牌的概率，`P_draft` 表示草稿模型分配的概率，`U` 是从区间 [0, 1) 均匀采样的随机数。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:68
+msgid ""
+"The decision logic for each draft token is as follows: if the inequality "
+"`P_target / P_draft ≥ U` holds, the draft token is accepted as output; "
+"conversely, if `P_target / P_draft < U`, the draft token is rejected."
+msgstr "每个草稿令牌的决策逻辑如下：如果不等式 `P_target / P_draft ≥ U` 成立，则草稿令牌被接受作为输出；反之，如果 `P_target / P_draft < U`，则草稿令牌被拒绝。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:70
+msgid ""
+"When a draft token is rejected, a recovery sampling process is triggered "
+"where a \"recovered token\" is resampled from the adjusted probability "
+"distribution defined as `Q = max(P_target - P_draft, 0)`. In the current "
+"MTP implementation, since `P_draft` is not provided and defaults to 1, "
+"the formulas simplify such that token acceptance occurs when `P_target ≥ "
+"U` and the recovery distribution becomes `Q = max(P_target - 1, 0)`."
+msgstr "当草稿令牌被拒绝时，会触发恢复采样过程，从调整后的概率分布 `Q = max(P_target - P_draft, 0)` 中重新采样一个“恢复令牌”。在当前 MTP 实现中，由于未提供 `P_draft` 且默认为 1，公式简化为：当 `P_target ≥ U` 时令牌被接受，恢复分布变为 `Q = max(P_target - 1, 0)`。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:72
+msgid "**2. Performance**"
+msgstr "**2. 性能**"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:74
+msgid ""
+"If the bonus token is accepted, the MTP model performs inference for "
+"(num_speculative + 1) tokens, including original main model output token "
+"and bonus token. If rejected, inference is performed for fewer tokens, "
+"depending on how many tokens are accepted."
+msgstr "如果奖励令牌被接受，MTP 模型将对 (num_speculative + 1) 个令牌执行推理，包括原始主模型输出令牌和奖励令牌。如果被拒绝，则根据接受了多少个令牌来执行更少令牌的推理。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:76
+msgid "DFX"
+msgstr "DFX"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:78
+msgid "Method Validation"
+msgstr "方法验证"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:80
+msgid ""
+"Currently, the spec_decode scenario only supports methods such as n-gram,"
+" EAGLE, EAGLE3, and MTP. If an incorrect parameter is passed for the "
+"method, the code will raise an error to alert the user that an incorrect "
+"method was provided."
+msgstr "目前，spec_decode 场景仅支持 n-gram、EAGLE、EAGLE3 和 MTP 等方法。如果为方法传递了错误的参数，代码将引发错误以提醒用户提供了不正确的方法。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:98
+msgid "Integer Validation"
+msgstr "整数验证"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:100
+msgid ""
+"The current npu_fused_infer_attention_score operator only supports "
+"integers less than 16 per decode round. Therefore, the maximum supported "
+"value for MTP is 15. If a value greater than 15 is provided, the code "
+"will raise an error and alert the user."
+msgstr "当前的 npu_fused_infer_attention_score 算子每轮解码仅支持小于 16 的整数。因此，MTP 支持的最大值为 15。如果提供了大于 15 的值，代码将引发错误并提醒用户。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:111
+msgid "Limitations"
+msgstr "限制"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:113
+msgid ""
+"Due to the fact that only a single layer of weights is exposed in "
+"DeepSeek's MTP, the accuracy and performance are not effectively "
+"guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due"
+" to current operator limitations, MTP supports a maximum of 15."
+msgstr "由于 DeepSeek 的 MTP 仅暴露了单层权重，因此在 MTP > 1（尤其是 MTP ≥ 3）的场景下，准确性和性能无法得到有效保证。此外，由于当前算子限制，MTP 最多支持 15。"
+
+#: ../../source/user_guide/feature_guide/Multi_Token_Prediction.md:114
+msgid ""
+"In the fullgraph mode with MTP > 1, the capture size of each ACLGraph "
+"must be an integer multiple of (num_speculative_tokens + 1)."
+msgstr "在 MTP > 1 的 fullgraph 模式下，每个 ACLGraph 的捕获大小必须是 (num_speculative_tokens + 1) 的整数倍。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po
@@ -0,0 +1,214 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:1
+msgid "Batch Invariance"
+msgstr "批次不变性"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:4
+msgid ""
+"Batch invariance is currently in beta. Some features are still under "
+"active development. Track progress and planned improvements at "
+"<https://github.com/vllm-project/vllm-ascend/issues/5487>"
+msgstr ""
+"批次不变性功能目前处于测试阶段。部分功能仍在积极开发中。请通过 "
+"<https://github.com/vllm-project/vllm-ascend/issues/5487> 跟踪进展和计划改进。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:8
+msgid ""
+"This document shows how to enable batch invariance in vLLM-Ascend. Batch "
+"invariance ensures that the output of a model is deterministic and "
+"independent of the batch size or the order of requests in a batch."
+msgstr ""
+"本文档介绍如何在 vLLM-Ascend 中启用批次不变性。批次不变性确保模型的输出是确定性的，且不依赖于批次大小或批次中请求的顺序。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:10
+msgid "Motivation"
+msgstr "动机"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:12
+msgid "Batch invariance is crucial for several use cases:"
+msgstr "批次不变性对于以下几个用例至关重要："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:14
+msgid ""
+"**Framework debugging**: Deterministic outputs make it easier to debug "
+"issues in the inference framework, as the same input will always produce "
+"the same output regardless of batching."
+msgstr ""
+"**框架调试**：确定性输出使得调试推理框架中的问题更加容易，因为无论批处理方式如何，相同的输入总是产生相同的输出。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:15
+msgid ""
+"**Model debugging**: Helps identify issues in model implementations by "
+"ensuring consistent behavior across different batch configurations."
+msgstr "**模型调试**：通过确保在不同批次配置下行为一致，帮助识别模型实现中的问题。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:16
+msgid ""
+"**Reinforcement Learning (RL)**: RL training often requires deterministic"
+" rollouts for reproducibility and stable training."
+msgstr "**强化学习 (RL)**：RL 训练通常需要确定性的推演过程，以确保可复现性和稳定的训练。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:17
+msgid ""
+"**Large-scale inference systems**: Systems that use vLLM as a component "
+"benefit from deterministic behavior for testing, validation, and "
+"consistency guarantees."
+msgstr "**大规模推理系统**：将 vLLM 作为组件的系统受益于确定性行为，便于测试、验证和保证一致性。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:19
+msgid "Hardware Requirements"
+msgstr "硬件要求"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:21
+msgid ""
+"Batch invariance currently requires Ascend 910B NPUs, because only the "
+"910B supports batch invariance with HCCL communication for now. We will "
+"support other NPUs in the future."
+msgstr ""
+"批次不变性目前需要 Ascend 910B NPU，因为目前只有 910B 支持通过 HCCL 通信实现批次不变性。我们未来将支持其他 NPU。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:24
+msgid "Software Requirements"
+msgstr "软件要求"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:26
+msgid ""
+"Batch invariance requires a customed operator library for 910B. We will "
+"release the customed operator library in future versions."
+msgstr "批次不变性需要为 910B 定制的算子库。我们将在未来版本中发布该定制算子库。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:29
+msgid "Enabling Batch Invariance"
+msgstr "启用批次不变性"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:31
+msgid ""
+"Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` "
+"environment variable to `1`:"
+msgstr "可以通过将环境变量 `VLLM_BATCH_INVARIANT` 设置为 `1` 来启用批次不变性："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:37
+msgid "Online Inference (Server Mode)"
+msgstr "在线推理（服务器模式）"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:39
+msgid "To start a vLLM server with batch invariance enabled:"
+msgstr "要启动一个启用了批次不变性的 vLLM 服务器："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:45
+msgid "Then use the OpenAI-compatible client:"
+msgstr "然后使用 OpenAI 兼容的客户端："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:68
+msgid "Offline Inference"
+msgstr "离线推理"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:70
+msgid "For offline batch inference with batch invariance:"
+msgstr "对于启用批次不变性的离线批处理推理："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:105
+msgid "Tested Models"
+msgstr "已测试模型"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:107
+msgid "Batch invariance has been tested and verified on the following models:"
+msgstr "批次不变性已在以下模型上经过测试和验证："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:109
+msgid "**Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`"
+msgstr "**Qwen3 (稠密模型)**：`Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:110
+msgid "**Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`"
+msgstr "**Qwen3 (MoE 模型)**：`Qwen/Qwen3-30B-A3B`"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:112
+msgid ""
+"Other models may also work, but these have been explicitly validated. If "
+"you encounter issues with a specific model, please report them on the "
+"[GitHub issue tracker](https://github.com/vllm-project/vllm-"
+"ascend/issues/new/choose)."
+msgstr ""
+"其他模型也可能适用，但上述模型已明确经过验证。如果您在使用特定模型时遇到问题，请在 [GitHub 问题跟踪器](https://github.com/vllm-project/vllm-ascend/issues/new/choose) 上报告。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:114
+msgid "Implementation Details"
+msgstr "实现细节"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:116
+msgid "When batch invariance is enabled, vLLM:"
+msgstr "当启用批次不变性时，vLLM 会："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:118
+msgid ""
+"Uses deterministic kernel implementations for attention and other "
+"operations"
+msgstr "对注意力机制和其他操作使用确定性的内核实现"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:119
+msgid "Ensures consistent numerical behavior across different batch sizes"
+msgstr "确保在不同批次大小下具有一致的数值行为"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:120
+msgid "Disables certain optimizations that may introduce non-determinism"
+msgstr "禁用某些可能引入非确定性的优化"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:123
+msgid ""
+"Enabling batch invariance may impact performance compared to the default "
+"non-deterministic mode. This trade-off is intentional to guarantee "
+"reproducibility."
+msgstr "与默认的非确定性模式相比，启用批次不变性可能会影响性能。这种权衡是为了保证可复现性而有意为之。"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:126
+msgid "Future Improvements"
+msgstr "未来改进"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:128
+msgid ""
+"The batch invariance feature is under active development. Planned "
+"improvements include:"
+msgstr "批次不变性功能正在积极开发中。计划的改进包括："
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:130
+msgid "Support for additional NPUs series"
+msgstr "支持更多 NPU 系列"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:131
+msgid "Expanded model coverage"
+msgstr "扩大模型覆盖范围"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:132
+msgid "Performance optimizations"
+msgstr "性能优化"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:133
+msgid "Additional testing and validation"
+msgstr "额外的测试和验证"
+
+#: ../../source/user_guide/feature_guide/batch_invariance.md:135
+msgid ""
+"For the latest status and to contribute ideas, see the [tracking "
+"issue](https://github.com/vllm-project/vllm-ascend/issues/5487)."
+msgstr "有关最新状态和贡献想法，请参阅 [跟踪问题](https://github.com/vllm-project/vllm-ascend/issues/5487)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po
@@ -0,0 +1,299 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:1
+msgid "Context Parallel Guide"
+msgstr "上下文并行指南"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:5
+msgid ""
+"This guide shows how to use Context Parallel, a long sequence inference "
+"optimization technique. Context Parallel includes `PCP` (Prefill Context "
+"Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory "
+"usage and improves inference speed in long sequence LLM inference."
+msgstr ""
+"本指南介绍如何使用上下文并行（Context Parallel），一种长序列推理优化技术。上下文并行包括 `PCP`（预填充上下文并行）和 `DCP`（解码上下文并行），可减少长序列LLM推理中的NPU内存使用并提升推理速度。"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:7
+msgid "Benefits of Context Parallel"
+msgstr "上下文并行的优势"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:9
+msgid ""
+"Context parallel mainly solves the problem of serving long context "
+"requests. As prefill and decode present quite different characteristics "
+"and have quite different SLO (service level objectives), we need to "
+"implement context parallel separately for them. The major considerations "
+"are:"
+msgstr ""
+"上下文并行主要解决服务长上下文请求的问题。由于预填充和解码阶段具有截然不同的特性以及不同的服务级别目标（SLO），我们需要分别为它们实现上下文并行。主要考虑点如下："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:11
+msgid ""
+"For long context prefill, we can use context parallel to reduce TTFT "
+"(time to first token) by amortizing the computation time of the prefill "
+"across query tokens."
+msgstr ""
+"对于长上下文预填充，我们可以使用上下文并行，通过将预填充的计算时间分摊到查询令牌上，从而减少首令牌时间（TTFT）。"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:12
+msgid ""
+"For long context decode, we can use context parallel to reduce KV cache "
+"duplication and offer more space for KV cache to increase the batch size "
+"(and hence the throughput)."
+msgstr ""
+"对于长上下文解码，我们可以使用上下文并行来减少KV缓存的重复存储，为KV缓存提供更多空间，从而增加批处理大小（进而提升吞吐量）。"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:14
+msgid ""
+"To learn more about the theory and implementation details of context "
+"parallel, please refer to the [context parallel developer "
+"guide](../../developer_guide/Design_Documents/context_parallel.md)."
+msgstr ""
+"要了解更多关于上下文并行的理论和实现细节，请参阅[上下文并行开发者指南](../../developer_guide/Design_Documents/context_parallel.md)。"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:16
+msgid "Supported Scenarios"
+msgstr "支持场景"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:18
+msgid ""
+"Currently context parallel can be used together with most other features,"
+" supported features are as follows:"
+msgstr "目前上下文并行可与大多数其他功能结合使用，支持的功能如下："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Eager"
+msgstr "Eager模式"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Graph"
+msgstr "Graph模式"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Prefix <br> Cache"
+msgstr "前缀<br>缓存"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Chunked <br> Prefill"
+msgstr "分块<br>预填充"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "SpecDecode <br> (MTP)"
+msgstr "推测解码<br>（MTP）"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "PD <br> disaggregation"
+msgstr "PD<br>解耦"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "MLAPO"
+msgstr "MLAPO"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "**PCP**"
+msgstr "**PCP**"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "✅"
+msgstr "✅"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "**DCP**"
+msgstr "**DCP**"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:25
+msgid "How to use Context Parallel"
+msgstr "如何使用上下文并行"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:27
+msgid ""
+"You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and "
+"`decode_context_parallel_size`, refer to the following example:"
+msgstr "您可以通过 `prefill_context_parallel_size` 和 `decode_context_parallel_size` 启用 `PCP` 和 `DCP`，请参考以下示例："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:29
+msgid "Offline example:"
+msgstr "离线示例："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:48
+msgid "Online example:"
+msgstr "在线示例："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:57
+msgid ""
+"The total world size is `tensor_parallel_size` * "
+"`prefill_context_parallel_size`, so the examples above need 4 NPUs for "
+"each."
+msgstr "总的世界大小为 `tensor_parallel_size` * `prefill_context_parallel_size`，因此上述示例各需要4个NPU。"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:59
+msgid "Constraints"
+msgstr "约束条件"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:61
+msgid "While using DCP, the following constraints must be met:"
+msgstr "使用DCP时，必须满足以下约束条件："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:62
+msgid "For MLA-based model, such as DeepSeek-R1:"
+msgstr "对于基于MLA的模型，例如DeepSeek-R1："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:63
+msgid "`tensor_parallel_size >= decode_context_parallel_size`"
+msgstr "`tensor_parallel_size >= decode_context_parallel_size`"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:64
+#, python-format
+msgid "`tensor_parallel_size % decode_context_parallel_size == 0`"
+msgstr "`tensor_parallel_size % decode_context_parallel_size == 0`"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:65
+msgid "For GQA-based model, such as Qwen3-235B:"
+msgstr "对于基于GQA的模型，例如Qwen3-235B："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:66
+msgid ""
+"`(tensor_parallel_size // num_key_value_heads) >= "
+"decode_context_parallel_size`"
+msgstr "`(tensor_parallel_size // num_key_value_heads) >= decode_context_parallel_size`"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:67
+#, python-format
+msgid ""
+"`(tensor_parallel_size // num_key_value_heads) % "
+"decode_context_parallel_size == 0`"
+msgstr "`(tensor_parallel_size // num_key_value_heads) % decode_context_parallel_size == 0`"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:69
+msgid ""
+"While using Context Parallel in KV cache transfer-needed scenario (e.g. "
+"KV pooling, PD disaggregation), to simplify KV cache transmission, "
+"`cp_kv_cache_interleave_size` must be set to the same value of KV cache "
+"`block_size`(default: 128), which specifies CP to split KV cache in a "
+"block-interleave style. For example:"
+msgstr ""
+"在需要KV缓存传输的场景（例如KV池化、PD解耦）中使用上下文并行时，为简化KV缓存传输，必须将 `cp_kv_cache_interleave_size` 设置为与KV缓存 `block_size`（默认：128）相同的值，这指定了CP以块交错方式分割KV缓存。例如："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:80
+msgid "Experimental Results"
+msgstr "实验结果"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:82
+msgid ""
+"To evaluate the effectiveness of Context Parallel in long sequence LLM "
+"inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, "
+"deploy PD disaggregate instances in the environment of 64 cards Ascend "
+"910C*64G (A3), the configuration and performance data are as follows."
+msgstr ""
+"为评估上下文并行在长序列LLM推理场景中的有效性，我们使用 **DeepSeek-R1-W8A8** 和 **Qwen3-235B**，在64卡Ascend 910C*64G（A3）环境中部署PD解耦实例，配置和性能数据如下。"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:84
+msgid "DeepSeek-R1-W8A8:"
+msgstr "DeepSeek-R1-W8A8："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Configuration"
+msgstr "配置"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Input length <br> 32k"
+msgstr "输入长度<br> 32k"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Input length <br> 64k"
+msgstr "输入长度<br> 64k"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Input length <br> 128k"
+msgstr "输入长度<br> 128k"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1"
+msgstr "P节点: (DP2 TP8 EP16) *2 <br> D节点: (DP32 EP32)*1"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 9.3s <br> TPOT: 72ms"
+msgstr "TTFT: 9.3s <br> TPOT: 72ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 22.8s <br> TPOT: 74ms"
+msgstr "TTFT: 22.8s <br> TPOT: 74ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 73.2s <br> TPOT: 82ms"
+msgstr "TTFT: 73.2s <br> TPOT: 82ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32)*1"
+msgstr "P节点: (PCP2 TP8 DCP8 EP16) *2 <br> D节点: (DP32 EP32)*1"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 7.9s <br> TPOT: 74ms"
+msgstr "TTFT: 7.9s <br> TPOT: 74ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 15.9s <br> TPOT: 78ms"
+msgstr "TTFT: 15.9s <br> TPOT: 78ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 46.0s <br> TPOT: 83ms"
+msgstr "TTFT: 46.0s <br> TPOT: 83ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md:91
+msgid "Qwen3-235B:"
+msgstr "Qwen3-235B："
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "Input length <br> 120k"
+msgstr "输入长度<br> 120k"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 5.1s <br> TPOT: 65ms"
+msgstr "TTFT: 5.1s <br> TPOT: 65ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 13.1s <br> TPOT: 85ms"
+msgstr "TTFT: 13.1s <br> TPOT: 85ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 33.9s <br> TPOT: 120ms"
+msgstr "TTFT: 33.9s <br> TPOT: 120ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32)*1"
+msgstr "P节点: (PCP2 TP8 DCP2 EP16) *2 <br> D节点: (DP32 EP32)*1"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 3.0s <br> TPOT: 66ms"
+msgstr "TTFT: 3.0s <br> TPOT: 66ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 8.9s <br> TPOT: 86ms"
+msgstr "TTFT: 8.9s <br> TPOT: 86ms"
+
+#: ../../source/user_guide/feature_guide/context_parallel.md
+msgid "TTFT: 22.7s <br> TPOT: 121ms"
+msgstr "TTFT: 22.7s <br> TPOT: 121ms"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po
@@ -0,0 +1,284 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:1
+msgid "CPU Binding"
+msgstr "CPU 绑定"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:5
+msgid ""
+"CPU Binding is a performance optimization feature for vLLM, specifically "
+"designed for servers equipped with **ARM architecture and Ascend NPUs**. "
+"It pins vLLM processes and threads to specific CPU cores to reduce "
+"CPU–NPU cross‑NUMA communication overhead and stabilize inference "
+"latency. This feature only adjusts host-side CPU affinity policies and "
+"**does not alter model execution logic or impact inference results**."
+msgstr ""
+"CPU 绑定是 vLLM 的一项性能优化功能，专为配备 **ARM 架构和昇腾 NPU** 的服务器设计。它将 vLLM 进程和线程固定到特定的 CPU 核心，以减少 CPU-NPU 跨 NUMA 通信开销并稳定推理延迟。此功能仅调整主机端的 CPU 亲和性策略，**不会改变模型执行逻辑或影响推理结果**。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:7
+msgid "Usage"
+msgstr "使用方法"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:9
+msgid "Online serving example with CPU binding enabled (by default)"
+msgstr "启用 CPU 绑定的在线服务示例（默认）"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:16
+msgid "Online serving example with CPU binding disabled"
+msgstr "禁用 CPU 绑定的在线服务示例"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:23
+msgid "Offline inference example with CPU binding enabled"
+msgstr "启用 CPU 绑定的离线推理示例"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:34
+msgid "Offline inference example with CPU binding disabled"
+msgstr "禁用 CPU 绑定的离线推理示例"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:45
+msgid "Dependencies"
+msgstr "依赖项"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:47
+msgid "Installation"
+msgstr "安装"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:49
+msgid "Ubuntu/Debian"
+msgstr "Ubuntu/Debian"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:56
+msgid "RHEL/CentOS/Alma/Rocky"
+msgstr "RHEL/CentOS/Alma/Rocky"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:62
+msgid "openEuler"
+msgstr "openEuler"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:68
+msgid "IRQ binding's additional considerations"
+msgstr "IRQ 绑定的额外注意事项"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:70
+msgid ""
+"For best results, if you run inside a docker container, which `systemctl`"
+" is likely unavailable, stop `irqbalance` service on the host manually "
+"before starting vLLM. Also make sure the container has the necessary "
+"permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:"
+msgstr ""
+"为获得最佳效果，如果您在 Docker 容器内运行（容器内可能没有 `systemctl`），请在启动 vLLM 前手动在主机上停止 `irqbalance` 服务。同时确保容器具有写入 `/proc/irq/*/smp_affinity` 以进行 IRQ 绑定所需的权限："
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:72
+msgid "**Stop `irqbalance` service**:"
+msgstr "**停止 `irqbalance` 服务**："
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:74
+msgid ""
+"For example, on Ubuntu system, you can run the following command to stop "
+"irqbalance:"
+msgstr "例如，在 Ubuntu 系统上，您可以运行以下命令来停止 irqbalance："
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:79
+msgid "After you finish the vLLM process, you can restore irqbalance on the host:"
+msgstr "完成 vLLM 进程后，您可以在主机上恢复 irqbalance："
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:85
+msgid "**Permissions**:"
+msgstr "**权限**："
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:86
+msgid "Read access to `/proc/self/status` and `/proc/interrupts`"
+msgstr "对 `/proc/self/status` 和 `/proc/interrupts` 的读取权限"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:87
+msgid "Write access to `/proc/irq/*/smp_affinity` for IRQ binding"
+msgstr "对 `/proc/irq/*/smp_affinity` 的写入权限（用于 IRQ 绑定）"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:89
+msgid "Common Issues & Troubleshooting"
+msgstr "常见问题与故障排除"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "Error/Warning Message"
+msgstr "错误/警告信息"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "Core Cause"
+msgstr "核心原因"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "Solution"
+msgstr "解决方案"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "Can not get running npu info."
+msgstr "无法获取运行的 NPU 信息。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid ""
+"The npu-smi process table is empty, or the `ASCEND_RT_VISIBLE_DEVICES` "
+"environment variable filters out all NPUs."
+msgstr "npu-smi 进程表为空，或者 `ASCEND_RT_VISIBLE_DEVICES` 环境变量过滤掉了所有 NPU。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid ""
+"1. Ensure the process is running on visible NPUs; 2. Verify that the "
+"`ASCEND_RT_VISIBLE_DEVICES` value matches the actual logical NPU IDs."
+msgstr "1. 确保进程在可见的 NPU 上运行；2. 验证 `ASCEND_RT_VISIBLE_DEVICES` 的值与实际逻辑 NPU ID 匹配。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "Insufficient CPUs for binding..."
+msgstr "用于绑定的 CPU 不足..."
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid ""
+"The number of CPU cores allocated to each NPU is less than the minimum "
+"requirement of 5."
+msgstr "分配给每个 NPU 的 CPU 核心数少于最低要求 5 个。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "1. Expand the allowed CPU list; 2. Reduce the number of visible NPUs."
+msgstr "1. 扩展允许的 CPU 列表；2. 减少可见 NPU 的数量。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "NPU topo affinity not found..."
+msgstr "未找到 NPU 拓扑亲和性..."
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "npu-smi is unable to retrieve NPU topology affinity information."
+msgstr "npu-smi 无法检索 NPU 拓扑亲和性信息。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid ""
+"Verify the integrity of the npu-smi installation and ensure the user has "
+"sufficient execution permissions."
+msgstr "验证 npu-smi 安装的完整性，并确保用户具有足够的执行权限。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid "Bind cpus failed in rankX..."
+msgstr "在 rankX 中绑定 CPU 失败..."
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid ""
+"The CPU binding process failed (e.g., taskset is unavailable, or the user"
+" lacks write permissions for /proc/irq)."
+msgstr "CPU 绑定过程失败（例如，taskset 不可用，或用户缺少对 /proc/irq 的写入权限）。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md
+msgid ""
+"1. Confirm that required tools (taskset, lscpu, npu-smi) are installed "
+"and available; 2. Verify the Cpus_allowed_list in `/proc/self/status` is "
+"valid."
+msgstr "1. 确认所需工具（taskset, lscpu, npu-smi）已安装且可用；2. 验证 `/proc/self/status` 中的 Cpus_allowed_list 是有效的。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:98
+msgid "Key Limitations"
+msgstr "主要限制"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:100
+msgid "ARM architecture only: Binding is automatically skipped on x86_64 systems."
+msgstr "仅限 ARM 架构：在 x86_64 系统上会自动跳过绑定。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:102
+msgid ""
+"Symmetric NUMA layout required for optimal performance: CPU numbering "
+"should be aligned with NUMA nodes. Non-symmetric layouts may result in "
+"cross-NUMA CPU pools, reducing locality."
+msgstr "需要对称的 NUMA 布局以获得最佳性能：CPU 编号应与 NUMA 节点对齐。非对称布局可能导致跨 NUMA 的 CPU 池，降低局部性。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:104
+msgid ""
+"IRQ binding requires write permissions for /proc/irq. Memory binding "
+"depends on the `migratepages` tool; if unavailable, memory migration is "
+"skipped."
+msgstr "IRQ 绑定需要对 /proc/irq 的写入权限。内存绑定依赖于 `migratepages` 工具；如果不可用，则跳过内存迁移。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:106
+msgid "FAQ"
+msgstr "常见问题解答"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:108
+msgid "**Q1: Does CPU binding work on x86_64?**"
+msgstr "**Q1: CPU 绑定在 x86_64 上有效吗？**"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:110
+msgid "No. The binding is skipped on non‑ARM CPUs."
+msgstr "否。在非 ARM CPU 上会跳过绑定。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:112
+msgid "**Q2: Why are only the current rank’s IRQs bound?**"
+msgstr "**Q2: 为什么只绑定当前 rank 的 IRQ？**"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:114
+msgid ""
+"To avoid multiple processes overwriting IRQ affinity settings for the "
+"same device."
+msgstr "为了避免多个进程覆盖同一设备的 IRQ 亲和性设置。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:116
+msgid "**Q3: What if my cpuset already limits CPUs?**"
+msgstr "**Q3: 如果我的 cpuset 已经限制了 CPU 怎么办？**"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:118
+msgid ""
+"The binder uses Cpus_allowed_list from /proc/self/status as the only "
+"eligible CPU set. Ensure this list is large enough."
+msgstr "绑定器使用来自 /proc/self/status 的 Cpus_allowed_list 作为唯一符合条件的 CPU 集合。请确保此列表足够大。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:120
+msgid "**Q4: Does CPU binding change model outputs?**"
+msgstr "**Q4: CPU 绑定会改变模型输出吗？**"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:122
+msgid ""
+"No. It only affects host‑side affinity and should not change numerical "
+"results."
+msgstr "不会。它只影响主机端的亲和性，不应改变数值结果。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:126
+msgid "Summary"
+msgstr "总结"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:128
+msgid ""
+"**Core Objective**: Reduce cross‑NUMA communication by pinning vLLM "
+"processes and threads to specific CPU cores, thereby stabilizing "
+"inference latency in Ascend NPU deployments (only applicable to ARM "
+"architectures)."
+msgstr "**核心目标**：通过将 vLLM 进程和线程固定到特定的 CPU 核心来减少跨 NUMA 通信，从而稳定昇腾 NPU 部署中的推理延迟（仅适用于 ARM 架构）。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:130
+msgid ""
+"**Usage**: Enable or disable with `enable_cpu_binding` via "
+"`additional_config` in both online and offline workflows."
+msgstr "**使用方法**：在在线和离线工作流中，通过 `additional_config` 中的 `enable_cpu_binding` 参数启用或禁用。"
+
+#: ../../source/user_guide/feature_guide/cpu_binding.md:132
+msgid ""
+"**Key Limitations**: ARM‑only; relies on symmetric NUMA layouts; binding "
+"fails if the CPU pool has fewer than 5 cores; binding errors trigger a "
+"warning log but do not terminate the process."
+msgstr "**主要限制**：仅限 ARM；依赖于对称的 NUMA 布局；如果 CPU 池少于 5 个核心，绑定会失败；绑定错误会触发警告日志但不会终止进程。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po
@@ -0,0 +1,108 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:1
+msgid "Dynamic Batch"
+msgstr "动态批处理"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:3
+msgid ""
+"Dynamic batch is a technique that dynamically adjusts the chunksize "
+"during each inference iteration within the chunked prefilling strategy "
+"according to the resources and SLO targets, thereby improving the "
+"effective throughput and decreasing the TBT."
+msgstr ""
+"动态批处理是一种技术，它根据资源和SLO目标，在分块预填充策略的每次推理迭代中动态调整块大小，从而提高有效吞吐量并降低TBT。"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:5
+msgid ""
+"Dynamic batch is controlled by the value of the "
+"`--SLO_limits_for_dynamic_batch`. Notably, only 910 B3 is supported with "
+"decode token number scales below 2048 so far. Especially, the "
+"improvements are quite obvious on Qwen, Llama models. We are working on "
+"further improvements and this feature will support more XPUs in the "
+"future."
+msgstr ""
+"动态批处理由 `--SLO_limits_for_dynamic_batch` 参数的值控制。值得注意的是，目前仅支持910 B3，且解码token数量规模需低于2048。特别是在Qwen、Llama模型上，改进效果相当明显。我们正在进行进一步的改进，该功能未来将支持更多XPU。"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:10
+msgid "Getting started"
+msgstr "快速开始"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:12
+msgid "Prerequisites"
+msgstr "先决条件"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:14
+msgid ""
+"Dynamic batch now depends on an offline cost model saved in a lookup "
+"table to refine the token budget. The lookup table is saved in a '.csv' "
+"file, which should be first downloaded from [A2-B3-BLK128.csv](https"
+"://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-"
+"ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to "
+"the path `vllm_ascend/core/profile_table.csv`"
+msgstr ""
+"动态批处理目前依赖于一个保存在查找表中的离线成本模型来优化token预算。该查找表保存在一个'.csv'文件中，需要先从[A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv)下载，重命名后保存到路径 `vllm_ascend/core/profile_table.csv`。"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:16
+msgid ""
+"`Pandas` is needed to load the lookup table, in case pandas is not "
+"installed."
+msgstr "需要 `Pandas` 来加载查找表，以防未安装pandas。"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:22
+msgid "Tuning Parameters"
+msgstr "调优参数"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:24
+msgid ""
+"`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) "
+"for the dynamic batch feature, larger values impose more constraints on "
+"the latency limitation, leading to higher effective throughput. The "
+"parameter can be selected according to the specific models or service "
+"requirements."
+msgstr ""
+"`--SLO_limits_for_dynamic_batch` 是动态批处理功能的调优参数（整数类型），较大的值会对延迟限制施加更多约束，从而带来更高的有效吞吐量。可以根据具体模型或服务需求选择该参数。"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:32
+msgid "Supported Models"
+msgstr "支持的模型"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:34
+msgid ""
+"So far, dynamic batch performs better on several dense models including "
+"Qwen and Llama (from 8B to 32B) with `tensor_parallel_size=8`. For "
+"different models, a proper `SLO_limits_for_dynamic_batch` parameter is "
+"needed. The empirical value of this parameter is generally `35, 50, or "
+"75`. Therefore, some additional tests are needed to select the best "
+"parameter."
+msgstr ""
+"目前，动态批处理在几个密集模型上表现更好，包括Qwen和Llama（从8B到32B），且 `tensor_parallel_size=8`。对于不同的模型，需要一个合适的 `SLO_limits_for_dynamic_batch` 参数。该参数的经验值通常是 `35、50或75`。因此，需要进行一些额外的测试来选择最佳参数。"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:36
+msgid "Usage"
+msgstr "使用方法"
+
+#: ../../source/user_guide/feature_guide/dynamic_batch.md:38
+msgid ""
+"Dynamic batch is used in the online inference. A fully executable example"
+" is as follows:"
+msgstr "动态批处理用于在线推理。一个完全可执行的示例如下："
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po
@@ -0,0 +1,237 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:1
+msgid "Disaggregated-encoder"
+msgstr "解耦编码器"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:3
+msgid "Why disaggregated-encoder?"
+msgstr "为何需要解耦编码器？"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:5
+msgid ""
+"A **disaggregated encoder** runs the vision-encoder stage of a multimodal"
+" LLM in a process that is separate from the pre-fill / decoder stage. "
+"Deploying these two stages in independent vLLM instances brings three "
+"practical benefits:"
+msgstr ""
+"**解耦编码器** 将多模态大语言模型的视觉编码器阶段运行在与预填充/解码器阶段分离的进程中。将这两个阶段部署在独立的 vLLM 实例中，带来三个实际好处："
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:7
+msgid "**Independent, fine-grained scaling**"
+msgstr "**独立、细粒度的扩展**"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:9
+msgid ""
+"Vision encoders are lightweight, while language models are orders of "
+"magnitude larger."
+msgstr "视觉编码器是轻量级的，而语言模型则要大几个数量级。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:10
+msgid ""
+"The language model can be parallelised without affecting the encoder "
+"fleet."
+msgstr "语言模型可以并行化，而不影响编码器集群。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:11
+msgid "Encoder nodes can be added or removed independently."
+msgstr "编码器节点可以独立地添加或移除。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:13
+msgid "**Lower time-to-first-token (TTFT)**"
+msgstr "**降低首令牌生成时间 (TTFT)**"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:15
+msgid "Language-only requests bypass the vision encoder entirely."
+msgstr "纯文本请求完全绕过视觉编码器。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:16
+msgid ""
+"Encoder output is injected only at required attention layers, shortening "
+"the pre-fill critical path."
+msgstr "编码器输出仅在所需的注意力层注入，缩短了预填充的关键路径。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:18
+msgid "**Cross-process reuse and caching of encoder outputs**"
+msgstr "**编码器输出的跨进程复用与缓存**"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:20
+msgid "In-process encoders confine reuse to a single worker."
+msgstr "进程内编码器将复用限制在单个工作进程内。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:21
+msgid ""
+"A remote, shared cache lets any worker retrieve existing embeddings, "
+"eliminating redundant computation."
+msgstr "远程共享缓存允许任何工作进程检索现有的嵌入向量，从而消除冗余计算。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:23
+msgid ""
+"Design doc: <https://docs.google.com/document/d"
+"/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>"
+msgstr ""
+"设计文档：<https://docs.google.com/document/d"
+"/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:27
+msgid "Usage"
+msgstr "使用方法"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:29
+msgid ""
+"The current reference pathway is **ExampleConnector**. The ready-to-run "
+"scripts below show the workflow:"
+msgstr "当前的参考实现路径是 **ExampleConnector**。以下开箱即用的脚本展示了工作流程："
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:32
+msgid ""
+"1 Encoder instance + 1 PD instance: "
+"`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`"
+msgstr ""
+"1 个编码器实例 + 1 个 PD 实例："
+"`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:35
+msgid ""
+"1 Encoder instance + 1 Prefill instance + 1 Decode instance: "
+"`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`"
+msgstr ""
+"1 个编码器实例 + 1 个预填充实例 + 1 个解码实例："
+"`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:40
+msgid "Development"
+msgstr "开发说明"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:42
+msgid "![alt text](<./images/epd_disaggregation.jpg>)"
+msgstr "![替代文本](<./images/epd_disaggregation.jpg>)"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:42
+msgid "alt text"
+msgstr "替代文本"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:44
+msgid "Disaggregated encoding is implemented by running two parts:"
+msgstr "解耦编码通过运行两个部分来实现："
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:46
+msgid "**Encoder instance** – a vLLM instance to perform vision encoding."
+msgstr "**编码器实例** – 一个执行视觉编码的 vLLM 实例。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:47
+msgid "**Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode."
+msgstr "**预填充/解码 (PD) 实例** – 运行语言预填充和解码。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:48
+msgid ""
+"PD can be in either a single normal instance with (E + PD) or in "
+"disaggregated instances with (E + P + D)"
+msgstr "PD 可以是一个包含 (E + PD) 的单一常规实例，也可以是解耦的 (E + P + D) 实例"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:50
+msgid ""
+"A connector transfers encoder-cache (EC) embeddings from the encoder "
+"instance to the PD instance.   All related code is under "
+"`vllm/distributed/ec_transfer`."
+msgstr ""
+"一个连接器将编码器缓存 (EC) 嵌入向量从编码器实例传输到 PD 实例。所有相关代码位于 `vllm/distributed/ec_transfer` 目录下。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:53
+msgid "Key abstractions"
+msgstr "关键抽象"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:55
+msgid ""
+"**ECConnector** – interface for retrieving EC caches produced by the "
+"encoder."
+msgstr "**ECConnector** – 用于检索编码器生成的 EC 缓存的接口。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:56
+msgid "*Scheduler role* – checks cache existence and schedules loads."
+msgstr "*调度器角色* – 检查缓存是否存在并调度加载。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:57
+msgid "*Worker role* – loads the embeddings into memory."
+msgstr "*工作进程角色* – 将嵌入向量加载到内存中。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:59
+msgid "**EPD Load Balance Proxy** -"
+msgstr "**EPD 负载均衡代理** -"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:60
+msgid ""
+"*Multi-Path Scheduling Strategy* - dynamically diverts the multimodal "
+"request or text requests to the corresponding inference path"
+msgstr "*多路径调度策略* - 动态地将多模态请求或文本请求分流到相应的推理路径"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:61
+msgid ""
+"*Instance-Level Dynamic Load Balancing* -  dispatches multimodal requests"
+" based on a least-loaded strategy, using a priority queue to balance the "
+"active token workload across instances."
+msgstr "*实例级动态负载均衡* - 基于最小负载策略分发多模态请求，使用优先级队列来平衡各实例间的活跃令牌工作负载。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:63
+msgid ""
+"We create the example setup with the **MooncakeLayerwiseConnector** from "
+"`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`"
+" and refer to the "
+"`examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py`"
+" to facilitate the kv transfer between P and D. For step-by-step "
+"deployment and configuration of Mooncake, refer to the following guide:"
+"   "
+"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
+msgstr ""
+"我们使用来自 `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` 的 **MooncakeLayerwiseConnector** 创建示例设置，并参考 "
+"`examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` 来促进 P 和 D 之间的 KV 传输。关于 Mooncake 的逐步部署和配置，请参考以下指南："
+"   "
+"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:66
+msgid ""
+"For the PD disaggregation part, when using MooncakeLayerwiseConnector: "
+"The request first enters the Decoder instance,the Decoder triggers a "
+"remote prefill task in reverse via the Metaserver. The Prefill node then "
+"executes inference and pushes KV Cache layer-wise to the Decoder, "
+"overlapping computation with transmission. Once the transfer is complete,"
+" the Decoder seamlessly continues with the subsequent token generation. "
+"`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` "
+"shows the brief idea about the disaggregated prefill."
+msgstr ""
+"对于 PD 解耦部分，当使用 MooncakeLayerwiseConnector 时：请求首先进入解码器实例，解码器通过元服务器反向触发一个远程预填充任务。然后预填充节点执行推理，并将 KV 缓存逐层推送到解码器，实现计算与传输的重叠。一旦传输完成，解码器无缝地继续后续的令牌生成。`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` 展示了关于解耦预填充的简要思路。"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:69
+msgid "Limitations"
+msgstr "限制"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:71
+msgid ""
+"Disable `--mm-processor-cache-gb 0` if you want to use cross-process "
+"caching"
+msgstr "如果要使用跨进程缓存，请禁用 `--mm-processor-cache-gb 0`"
+
+#: ../../source/user_guide/feature_guide/epd_disaggregation.md:73
+msgid ""
+"For the PD disaggregation part, refer to the limitations of PD "
+"decomposition"
+msgstr "对于 PD 解耦部分，请参考 PD 分解的限制"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po
@@ -0,0 +1,247 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:1
+msgid "Expert Load Balance (EPLB)"
+msgstr "专家负载均衡 (EPLB)"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:5
+msgid ""
+"Expert balancing for MoE models in LLM serving is essential for optimal "
+"performance. Dynamically changing experts during inference can negatively"
+" impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due "
+"to stop-the-world operations. SwiftBalancer enables asynchronous expert "
+"load balancing with zero-overhead expert movement, ensuring seamless "
+"service continuity."
+msgstr ""
+"在LLM服务中，MoE模型的专家均衡对于实现最佳性能至关重要。推理过程中动态改变专家会因全局暂停操作而对TTFT（首词元时间）和TPOT（每输出词元时间）产生负面影响。SwiftBalancer支持异步专家负载均衡，实现零开销的专家迁移，确保服务无缝连续。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:7
+msgid "EPLB Effects"
+msgstr "EPLB效果"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:9
+msgid ""
+"Reduced Latency: Dynamically balances expert loads to minimize TTFT and "
+"TPOT by distributing workloads evenly across experts."
+msgstr "降低延迟：动态均衡专家负载，通过在各专家间均匀分配工作负载，最小化TTFT和TPOT。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:10
+msgid ""
+"Enhanced Throughput: Optimizes GPU utilization, increasing token "
+"generation speed under high-concurrency scenarios."
+msgstr "提升吞吐量：优化GPU利用率，在高并发场景下提高词元生成速度。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:11
+msgid ""
+"Zero-Overhead Movement: Expert redistribution occurs asynchronously "
+"without interrupting ongoing inference requests."
+msgstr "零开销迁移：专家重分布异步进行，不会中断正在进行的推理请求。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:12
+msgid ""
+"Adaptive Scaling: Automatically adjusts to workload fluctuations while "
+"maintaining stable performance."
+msgstr "自适应扩展：自动适应工作负载波动，同时保持性能稳定。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:13
+msgid ""
+"Fault Tolerance: Redundant expert placement ensures system resilience "
+"during hardware failures."
+msgstr "容错性：冗余的专家放置确保在硬件故障期间系统的韧性。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:15
+msgid "Support Scenarios"
+msgstr "支持场景"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:17
+msgid "Models"
+msgstr "模型"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:19
+msgid "DeepSeekV3/V3.1/R1, Qwen3-MoE"
+msgstr "DeepSeekV3/V3.1/R1, Qwen3-MoE"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:21
+msgid "MOE QuantType"
+msgstr "MOE量化类型"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:23
+msgid "W8A8-Dynamic"
+msgstr "W8A8-Dynamic"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:25
+msgid "How to Use EPLB"
+msgstr "如何使用EPLB"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:27
+msgid "Dynamic EPLB"
+msgstr "动态EPLB"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:29
+msgid ""
+"We need to add environment variable `export DYNAMIC_EPLB=\"true\"` to "
+"enable vLLM EPLB. Enable dynamic balancing with auto-tuned parameters. "
+"Adjust expert_heat_collection_interval and algorithm_execution_interval "
+"based on workload patterns."
+msgstr ""
+"我们需要添加环境变量 `export DYNAMIC_EPLB=\"true\"` 来启用vLLM EPLB。启用具有自动调优参数的动态均衡。根据工作负载模式调整 expert_heat_collection_interval 和 algorithm_execution_interval。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:42
+msgid "Static EPLB"
+msgstr "静态EPLB"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:44
+msgid "Initial Setup (Record Expert Map)"
+msgstr "初始设置（记录专家映射）"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:46
+msgid ""
+"We need to add environment variable `export EXPERT_MAP_RECORD=\"true\"` "
+"to record expert map. Generate the initial expert distribution map using "
+"expert_map_record_path. This creates a baseline configuration for future "
+"deployments."
+msgstr ""
+"我们需要添加环境变量 `export EXPERT_MAP_RECORD=\"true\"` 来记录专家映射。使用 expert_map_record_path 生成初始专家分布映射。这将为未来的部署创建一个基线配置。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:60
+msgid "Subsequent Deployments (Use Recorded Map)"
+msgstr "后续部署（使用记录的映射）"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:62
+msgid ""
+"Load the pre-recorded expert map for consistent performance. This avoids "
+"recalculating distributions at runtime."
+msgstr "加载预记录的专家映射以获得一致的性能。这避免了在运行时重新计算分布。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:73
+msgid "Critical Considerations"
+msgstr "关键注意事项"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:75
+msgid "Parameter Tuning:"
+msgstr "参数调优："
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:76
+msgid ""
+"expert_heat_collection_interval: Higher values (e.g., 400+) for stable "
+"workloads; lower values (e.g., 100-200) for fluctuating traffic."
+msgstr "expert_heat_collection_interval：对于稳定的工作负载使用较高值（例如400+）；对于波动流量使用较低值（例如100-200）。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:77
+msgid ""
+"algorithm_execution_interval: Should be ≥ 30 to avoid premature balancing"
+" during startup."
+msgstr "algorithm_execution_interval：应≥30，以避免在启动期间过早进行均衡。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:78
+msgid ""
+"num_redundant_experts: Must match tensor-parallel size (e.g., 16 for 16 "
+"GPUs) to ensure sufficient redundancy."
+msgstr "num_redundant_experts：必须与张量并行大小匹配（例如，16个GPU对应16），以确保足够的冗余。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:80
+msgid "Hardware Requirements:"
+msgstr "硬件要求："
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:81
+msgid ""
+"Ensure that all GPUs have identical memory capacity and compute "
+"capabilities."
+msgstr "确保所有GPU具有相同的内存容量和计算能力。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:82
+msgid ""
+"Network bandwidth must support expert redistribution traffic (≥ 10 Gbps "
+"recommended)."
+msgstr "网络带宽必须支持专家重分布流量（建议≥10 Gbps）。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:84
+msgid "Model Compatibility:"
+msgstr "模型兼容性："
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:85
+msgid ""
+"Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE"
+" models) are compatible."
+msgstr "仅支持显式专家并行的MoE模型（例如Qwen3 MoE模型）是兼容的。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:86
+msgid ""
+"Verify model architecture supports dynamic expert routing through "
+"`--enable-expert-parallel`."
+msgstr "验证模型架构是否通过 `--enable-expert-parallel` 支持动态专家路由。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:88
+msgid "Monitoring & Validation:"
+msgstr "监控与验证："
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:89
+msgid ""
+"Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and "
+"gpu_utilization."
+msgstr "跟踪指标：expert_load_balance_ratio, ttft_p99, tpot_avg 和 gpu_utilization。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:90
+msgid "Use vLLM monitor to detect imbalances during runtime."
+msgstr "使用vLLM监控器在运行时检测不均衡。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:91
+msgid ""
+"Always verify expert map JSON structure before loading (validate with jq "
+"or similar tools)."
+msgstr "在加载前始终验证专家映射的JSON结构（使用jq或类似工具验证）。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:93
+msgid "Startup Behavior:"
+msgstr "启动行为："
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:94
+msgid ""
+"Initial requests may experience higher latency during the first balancing"
+" cycle (typically 1-2 minutes)."
+msgstr "初始请求在第一个均衡周期（通常1-2分钟）内可能会经历较高的延迟。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:95
+msgid "Avoid sudden traffic spikes during the warm-up phase."
+msgstr "避免在预热阶段出现突发的流量高峰。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:97
+msgid "Common Pitfalls:"
+msgstr "常见陷阱："
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:98
+msgid ""
+"Incorrect tensor-parallel-size vs. actual GPU count → causes resource "
+"underutilization."
+msgstr "张量并行大小与实际GPU数量不匹配 → 导致资源利用不足。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:99
+msgid "Using expert_map_path without generating the map first → runtime errors."
+msgstr "未先生成映射就使用 expert_map_path → 运行时错误。"
+
+#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:100
+msgid "Setting num_redundant_experts > available GPUs → system failure."
+msgstr "设置 num_redundant_experts > 可用GPU数量 → 系统故障。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po
@@ -0,0 +1,164 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:1
+msgid "External DP"
+msgstr "外部数据并行"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:3
+msgid ""
+"For larger-scale deployments especially, it can make sense to handle the "
+"orchestration and load balancing of data parallel ranks externally."
+msgstr "特别是在大规模部署场景下，在外部处理数据并行等级的编排与负载均衡是有意义的。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:5
+msgid ""
+"In this case, it's more convenient to treat each DP rank like a separate "
+"vLLM deployment, with its own endpoint, and have an external router "
+"balance HTTP requests between them, making use of appropriate real-time "
+"telemetry from each server for routing decisions."
+msgstr "在这种情况下，将每个数据并行等级视为一个独立的 vLLM 部署（拥有自己的端点），并使用一个外部路由器在它们之间平衡 HTTP 请求，同时利用来自每个服务器的适当实时遥测数据来做出路由决策，会更加方便。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:7
+msgid "Getting Start"
+msgstr "开始使用"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:9
+msgid ""
+"The functionality of [external "
+"DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external"
+"#external-load-balancing) is already natively supported by vLLM. In vllm-"
+"ascend we provide two enhanced functionalities:"
+msgstr "[外部数据并行](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) 功能已由 vLLM 原生支持。在 vllm-ascend 中，我们提供了两项增强功能："
+
+#: ../../source/user_guide/feature_guide/external_dp.md:11
+msgid ""
+"A launch script that helps to launch multiple vLLM instances in one "
+"command."
+msgstr "一个启动脚本，用于通过一条命令启动多个 vLLM 实例。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:12
+msgid "A request-length-aware load-balance proxy for external DP."
+msgstr "一个支持外部数据并行、可感知请求长度的负载均衡代理。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:14
+msgid "This tutorial will introduce the usage of them."
+msgstr "本教程将介绍它们的使用方法。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:16
+msgid "Prerequisites"
+msgstr "先决条件"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:18
+msgid "Python 3.10+"
+msgstr "Python 3.10+"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:19
+msgid "Install dependencies needed by load-balance proxy server:"
+msgstr "安装负载均衡代理服务器所需的依赖项："
+
+#: ../../source/user_guide/feature_guide/external_dp.md:25
+msgid "Starting External DP Servers"
+msgstr "启动外部数据并行服务器"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:27
+msgid ""
+"First, you need to have at least two vLLM servers running in data "
+"parallel. These can be mock servers or actual vLLM servers. Note that "
+"this proxy also works with only one vLLM server running, but will fall "
+"back to direct request forwarding which is meaningless."
+msgstr "首先，您需要至少运行两个处于数据并行模式的 vLLM 服务器。这些可以是模拟服务器或实际的 vLLM 服务器。请注意，此代理在仅运行一个 vLLM 服务器时也能工作，但会退化为直接请求转发，这没有意义。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:29
+msgid ""
+"You can start external vLLM DP servers one-by-one manually or using the "
+"launch script in `examples/external_online_dp`. For scenarios of large DP"
+" size across multiple nodes, we recommend using our launch script for "
+"convenience."
+msgstr "您可以手动逐个启动外部 vLLM 数据并行服务器，也可以使用 `examples/external_online_dp` 中的启动脚本。对于跨多个节点的大规模数据并行场景，我们建议使用我们的启动脚本以方便操作。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:31
+msgid "Manually Launch"
+msgstr "手动启动"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:39
+msgid "Use Launch Script"
+msgstr "使用启动脚本"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:41
+msgid ""
+"Firstly, you need to modify the "
+"`examples/external_online_dp/run_dp_template.sh` according to your vLLM "
+"configuration. Then you can use "
+"`examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM"
+" instances in one command on each node. It will internally call "
+"`examples/external_online_dp/run_dp_template.sh` for each DP rank with "
+"proper DP-related parameters."
+msgstr "首先，您需要根据您的 vLLM 配置修改 `examples/external_online_dp/run_dp_template.sh`。然后，您可以使用 `examples/external_online_dp/launch_online_dp.py` 在每个节点上通过一条命令启动多个 vLLM 实例。它将在内部为每个数据并行等级调用 `examples/external_online_dp/run_dp_template.sh`，并传入适当的数据并行相关参数。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:43
+msgid "An example of running external DP in one single node:"
+msgstr "在单个节点上运行外部数据并行的示例："
+
+#: ../../source/user_guide/feature_guide/external_dp.md:51
+msgid "An example of running external DP in two nodes:"
+msgstr "在两个节点上运行外部数据并行的示例："
+
+#: ../../source/user_guide/feature_guide/external_dp.md:66
+msgid "Starting Load-balance Proxy Server"
+msgstr "启动负载均衡代理服务器"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:68
+msgid ""
+"After all vLLM DP instances are launched, you can now launch the load-"
+"balance proxy server, which serves as an entrypoint for coming requests "
+"and load-balances them between vLLM DP instances."
+msgstr "所有 vLLM 数据并行实例启动后，您现在可以启动负载均衡代理服务器。该服务器作为传入请求的入口点，并在各个 vLLM 数据并行实例之间进行负载均衡。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:70
+msgid "The proxy server has the following features:"
+msgstr "该代理服务器具有以下特性："
+
+#: ../../source/user_guide/feature_guide/external_dp.md:72
+msgid "Load balances requests to multiple vLLM servers based on request length."
+msgstr "基于请求长度，将请求负载均衡到多个 vLLM 服务器。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:73
+msgid ""
+"Supports OpenAI-compatible `/v1/completions` and `/v1/chat/completions` "
+"endpoints."
+msgstr "支持 OpenAI 兼容的 `/v1/completions` 和 `/v1/chat/completions` 端点。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:74
+msgid "Streams responses from backend servers to clients."
+msgstr "将来自后端服务器的响应流式传输给客户端。"
+
+#: ../../source/user_guide/feature_guide/external_dp.md:76
+msgid ""
+"To run the proxy server, you need to specify the host and port for each "
+"vLLM DP Instance:"
+msgstr "要运行代理服务器，您需要为每个 vLLM 数据并行实例指定主机和端口："
+
+#: ../../source/user_guide/feature_guide/external_dp.md:91
+msgid ""
+"After this, you can directly send requests to the proxy server and run DP"
+" with external load balancing."
+msgstr "此后，您可以直接向代理服务器发送请求，并运行具有外部负载均衡功能的数据并行。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/graph_mode.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/graph_mode.po
@@ -3,119 +3,120 @@
 # This file is distributed under the same license as the PROJECT project.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
 "Project-Id-Version: PROJECT VERSION\n"
 "Report-Msgid-Bugs-To: EMAIL@ADDRESS\n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language-Team: LANGUAGE <LL@li.org>\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/feature_guide/graph_mode.md:1
+#: ../../source/user_guide/feature_guide/graph_mode.md:1
 msgid "Graph Mode Guide"
 msgstr "图模式指南"

-#: ../../user_guide/feature_guide/graph_mode.md:4
+#: ../../source/user_guide/feature_guide/graph_mode.md:4
 msgid ""
 "This feature is currently experimental. In future versions, there may be "
-"behavioral changes around configuration, coverage, performance improvement."
+"behavioral changes around configuration, coverage, performance "
+"improvement."
 msgstr "此功能目前为实验性功能。在未来的版本中，配置、覆盖率和性能改进等方面的行为可能会有变化。"

-#: ../../user_guide/feature_guide/graph_mode.md:7
+#: ../../source/user_guide/feature_guide/graph_mode.md:8
+msgid ""
+"In context parallel scenario (i.e. prefill_context_parallel_size * "
+"decode_context_parallel_size > 1), \"cudagraph_mode\" is not sufficiently"
+" supported to be set to \"FULL\" yet."
+msgstr "在上下文并行场景下（即 prefill_context_parallel_size * decode_context_parallel_size > 1），目前尚不支持将 \"cudagraph_mode\" 充分设置为 \"FULL\"。"
+
+#: ../../source/user_guide/feature_guide/graph_mode.md:11
 msgid ""
 "This guide provides instructions for using Ascend Graph Mode with vLLM "
-"Ascend. Please note that graph mode is only available on V1 Engine. And only"
-" Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it "
-"stable and generalize in the next release."
-msgstr ""
-"本指南提供了在 vLLM Ascend 上使用 Ascend 图模式的操作说明。请注意，图模式仅在 V1 引擎上可用，并且从 0.9.0rc1 起，仅对"
-" Qwen、DeepSeek 系列模型进行了充分测试。我们将在下一个版本中使其更加稳定和通用。"
+"Ascend. Please note that graph mode is only available on V1 Engine. And "
+"only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We will "
+"make it stable and generalized in the next release."
+msgstr "本指南提供了在 vLLM Ascend 中使用昇腾图模式的操作说明。请注意，图模式仅在 V1 引擎上可用，并且从 0.9.0rc1 版本起，仅对 Qwen、DeepSeek 系列模型进行了充分测试。我们将在下一个版本中使其更加稳定和通用。"

-#: ../../user_guide/feature_guide/graph_mode.md:9
+#: ../../source/user_guide/feature_guide/graph_mode.md:13
 msgid "Getting Started"
 msgstr "快速入门"

-#: ../../user_guide/feature_guide/graph_mode.md:11
+#: ../../source/user_guide/feature_guide/graph_mode.md:15
 msgid ""
-"From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by "
-"default to keep the same behavior with vLLM. If you hit any issues, please "
-"feel free to open an issue on GitHub and fallback to eager mode temporarily "
-"by set `enforce_eager=True` when initializing the model."
-msgstr ""
-"从 v0.9.1rc1 版本起，使用 V1 引擎时，vLLM Ascend 默认将在图模式下运行模型，以保持与 vLLM "
-"同样的行为。如果遇到任何问题，欢迎在 GitHub 上提交 issue，并在初始化模型时通过设置 `enforce_eager=True` 临时切换回 "
-"eager 模式。"
+"From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode "
+"by default to keep the same behavior with vLLM. If you hit any issues, "
+"please feel free to open an issue on GitHub and fall back to the eager "
+"mode temporarily by setting `enforce_eager=True` when initializing the "
+"model."
+msgstr "从 v0.9.1rc1 版本起，在使用 V1 引擎时，vLLM Ascend 默认将在图模式下运行模型，以保持与 vLLM 一致的行为。如果遇到任何问题，欢迎在 GitHub 上提交 issue，并可在初始化模型时通过设置 `enforce_eager=True` 临时切换回 eager 模式。"

-#: ../../user_guide/feature_guide/graph_mode.md:13
-msgid "There are two kinds for graph mode supported by vLLM Ascend:"
+#: ../../source/user_guide/feature_guide/graph_mode.md:17
+msgid "There are two kinds of graph mode supported by vLLM Ascend:"
 msgstr "vLLM Ascend 支持两种图模式："

-#: ../../user_guide/feature_guide/graph_mode.md:14
+#: ../../source/user_guide/feature_guide/graph_mode.md:19
 msgid ""
-"**ACLGraph**: This is the default graph mode supported by vLLM Ascend. In "
-"v0.9.1rc1, only Qwen series models are well tested."
-msgstr ""
-"**ACLGraph**：这是 vLLM Ascend 支持的默认图模式。在 v0.9.1rc1 版本中，Qwen 和Deepseek系列模型得到了充分测试。"
+"**ACLGraph**: This is the default graph mode supported by vLLM Ascend. In"
+" v0.9.1rc1, Qwen and DeepSeek series models are well tested."
+msgstr "**ACLGraph**：这是 vLLM Ascend 支持的默认图模式。在 v0.9.1rc1 版本中，Qwen 和 DeepSeek 系列模型经过了充分测试。"

-#: ../../user_guide/feature_guide/graph_mode.md:15
+#: ../../source/user_guide/feature_guide/graph_mode.md:20
 msgid ""
-"**TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek "
-"series models are supported."
-msgstr "**TorchAirGraph**：这是GE图模式。在v0.9.1rc1版本中，仅支持DeepSeek系列模型。"
+"**XliteGraph**: This is the OpenEuler Xlite graph mode. In v0.11.0, only "
+"Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-VL are"
+" supported."
+msgstr "**XliteGraph**：这是 OpenEuler Xlite 图模式。在 v0.11.0 版本中，仅支持 Llama、Qwen 稠密系列模型、Qwen MoE 系列模型以及 Qwen3-VL。"

-#: ../../user_guide/feature_guide/graph_mode.md:17
+#: ../../source/user_guide/feature_guide/graph_mode.md:22
 msgid "Using ACLGraph"
 msgstr "使用 ACLGraph"

-#: ../../user_guide/feature_guide/graph_mode.md:18
+#: ../../source/user_guide/feature_guide/graph_mode.md:24
 msgid ""
-"ACLGraph is enabled by default. Take Qwen series models as an example, just "
-"set to use V1 Engine is enough."
+"ACLGraph is enabled by default. Take Qwen series models as an example, "
+"just set to use V1 Engine."
 msgstr "ACLGraph 默认启用。以 Qwen 系列模型为例，只需设置为使用 V1 引擎即可。"

-#: ../../user_guide/feature_guide/graph_mode.md:20
-#: ../../user_guide/feature_guide/graph_mode.md:41
-#: ../../user_guide/feature_guide/graph_mode.md:64
-msgid "offline example:"
+#: ../../source/user_guide/feature_guide/graph_mode.md:26
+#: ../../source/user_guide/feature_guide/graph_mode.md:51
+#: ../../source/user_guide/feature_guide/graph_mode.md:74
+msgid "Offline example:"
 msgstr "离线示例："

-#: ../../user_guide/feature_guide/graph_mode.md:31
-#: ../../user_guide/feature_guide/graph_mode.md:52
-#: ../../user_guide/feature_guide/graph_mode.md:74
-msgid "online example:"
+#: ../../source/user_guide/feature_guide/graph_mode.md:37
+#: ../../source/user_guide/feature_guide/graph_mode.md:62
+#: ../../source/user_guide/feature_guide/graph_mode.md:84
+msgid "Online example:"
 msgstr "在线示例："

-#: ../../user_guide/feature_guide/graph_mode.md:37
-msgid "Using TorchAirGraph"
-msgstr "使用 TorchAirGraph"
+#: ../../source/user_guide/feature_guide/graph_mode.md:43
+msgid "Using XliteGraph"
+msgstr "使用 XliteGraph"

-#: ../../user_guide/feature_guide/graph_mode.md:39
+#: ../../source/user_guide/feature_guide/graph_mode.md:45
 msgid ""
-"If you want to run DeepSeek series models with graph mode, you should use "
-"[TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html)."
-" In this case, additional config is required."
-msgstr ""
-"如果你想通过图模式运行 DeepSeek 系列模型，你应该使用 "
-"[TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html)。在这种情况下，需要额外的配置。"
+"If you want to run Llama, Qwen dense series models, Qwen MoE series "
+"models, or Qwen3-VL with Xlite graph mode, please install xlite, and set "
+"xlite_graph_config."
+msgstr "如果你想使用 Xlite 图模式运行 Llama、Qwen 稠密系列模型、Qwen MoE 系列模型或 Qwen3-VL，请安装 xlite 并设置 xlite_graph_config。"

-#: ../../user_guide/feature_guide/graph_mode.md:58
+#: ../../source/user_guide/feature_guide/graph_mode.md:68
 msgid ""
-"You can find more detail about additional config "
-"[here](../configuration/additional_config.md)."
-msgstr "你可以在[这里](../configuration/additional_config.md)找到关于附加配置的更多详细信息。"
+"You can find more details about "
+"[Xlite](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)"
+msgstr "你可以在 [Xlite](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md) 找到更多详细信息。"

-#: ../../user_guide/feature_guide/graph_mode.md:60
-msgid "Fallback to Eager Mode"
+#: ../../source/user_guide/feature_guide/graph_mode.md:70
+msgid "Fallback to the Eager Mode"
 msgstr "回退到 Eager 模式"

-#: ../../user_guide/feature_guide/graph_mode.md:62
+#: ../../source/user_guide/feature_guide/graph_mode.md:72
 msgid ""
-"If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to "
-"eager mode."
-msgstr "如果 `ACLGraph` 和 `TorchAirGraph` 都无法运行，你应该退回到 eager 模式。"
+"If `ACLGraph` and `XliteGraph` all fail to run, you should fall back to "
+"the eager mode."
+msgstr "如果 `ACLGraph` 和 `XliteGraph` 都无法运行，你应该退回到 eager 模式。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po
@@ -0,0 +1,644 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:1
+msgid "Ascend Store Deployment Guide"
+msgstr "Ascend Store 部署指南"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:3
+msgid "Environmental Dependencies"
+msgstr "环境依赖"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:5
+#: ../../source/user_guide/feature_guide/kv_pool.md:35
+msgid "Software:"
+msgstr "软件："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:6
+msgid "CANN >= 8.5.0"
+msgstr "CANN >= 8.5.0"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:7
+msgid "vLLM：main branch"
+msgstr "vLLM：main 分支"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:8
+msgid "vLLM-Ascend：main branch"
+msgstr "vLLM-Ascend：main 分支"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:9
+msgid "mooncake：>= 0.3.9"
+msgstr "mooncake：>= 0.3.9"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:11
+msgid "KV Pool Parameter Description"
+msgstr "KV Pool 参数说明"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:13
+msgid ""
+"`kv_connector_extra_config`: Additional Configurable Parameters for "
+"Pooling"
+msgstr "`kv_connector_extra_config`: 池化的额外可配置参数"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Parameter"
+msgstr "参数"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Description"
+msgstr "描述"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`lookup_rpc_port`"
+msgstr "`lookup_rpc_port`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"Port for RPC Communication Between Pooling Scheduler Process and Worker "
+"Process: Each Instance Requires a Unique Port Configuration."
+msgstr "池化调度进程与工作进程间 RPC 通信端口：每个实例需要配置唯一端口。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`load_async`"
+msgstr "`load_async`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Whether to Enable Asynchronous Loading. The default value is false."
+msgstr "是否启用异步加载。默认值为 false。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`backend`"
+msgstr "`backend`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Set the storage backend for kvpool, with the default being mooncake."
+msgstr "设置 kvpool 的存储后端，默认为 mooncake。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`consumer_is_to_put`"
+msgstr "`consumer_is_to_put`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Whether Decode node put KV Cache into KV Pool. The default value is false."
+msgstr "Decode 节点是否将 KV Cache 放入 KV Pool。默认值为 false。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`consumer_is_to_load`"
+msgstr "`consumer_is_to_load`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"Whether Decode node load KV cache from KV Pool. The default value is "
+"false."
+msgstr "Decode 节点是否从 KV Pool 加载 KV cache。默认值为 false。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`prefill_pp_size`"
+msgstr "`prefill_pp_size`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Prefill PP size, needs to be set when Prefill node enables PP."
+msgstr "Prefill PP 大小，当 Prefill 节点启用 PP 时需要设置。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`prefill_pp_layer_partition`"
+msgstr "`prefill_pp_layer_partition`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Prefill PP layer partition, needs to be set when Prefill node enables PP."
+msgstr "Prefill PP 层划分，当 Prefill 节点启用 PP 时需要设置。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:25
+msgid "Environment Variable Configuration"
+msgstr "环境变量配置"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:27
+msgid ""
+"To guarantee uniform hash generation, it is required to synchronize the "
+"PYTHONHASHSEED environment variable across all nodes upon enabling KV "
+"Pool."
+msgstr "为保证哈希生成的一致性，启用 KV Pool 时，需要在所有节点上同步 PYTHONHASHSEED 环境变量。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:33
+msgid "Example of using Mooncake as a KV Pool backend"
+msgstr "使用 Mooncake 作为 KV Pool 后端的示例"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:35
+msgid "Software:"
+msgstr "软件："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:36
+msgid "Check NPU HCCN Configuration:"
+msgstr "检查 NPU HCCN 配置："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:38
+msgid ""
+"Ensure that the hccn.conf file exists in the environment. If using "
+"Docker, mount it into the container."
+msgstr "确保环境中存在 hccn.conf 文件。如果使用 Docker，请将其挂载到容器中。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:44
+msgid "Install Mooncake"
+msgstr "安装 Mooncake"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:46
+msgid ""
+"Mooncake is the serving platform for Kimi, a leading LLM service provided"
+" by Moonshot AI.   Installation and Compilation Guide: "
+"<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-"
+"binaries>.   First, we need to obtain the Mooncake project. Refer to the "
+"following command:"
+msgstr ""
+"Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。   安装与编译指南："
+"<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-"
+"binaries>。   首先，我们需要获取 Mooncake 项目。参考以下命令："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:54
+msgid "(Optional) Replace go install url if the network is poor"
+msgstr "（可选）如果网络状况不佳，替换 go install 的 URL"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:61
+msgid "Install mpi"
+msgstr "安装 mpi"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:67
+msgid "Install the relevant dependencies. The installation of Go is not required."
+msgstr "安装相关依赖。无需安装 Go。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:73
+msgid "Compile and install"
+msgstr "编译并安装"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:83
+msgid "Set environment variables"
+msgstr "设置环境变量"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:85
+msgid "**Note:**"
+msgstr "**注意：**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:87
+msgid "Adjust the Python path according to your specific Python installation"
+msgstr "根据您具体的 Python 安装调整 Python 路径"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:88
+msgid ""
+"Ensure `/usr/local/lib` and `/usr/local/lib64` are in your "
+"`LD_LIBRARY_PATH`"
+msgstr "确保 `/usr/local/lib` 和 `/usr/local/lib64` 在您的 `LD_LIBRARY_PATH` 中"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:94
+msgid "Environment Variables Description"
+msgstr "环境变量说明"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Hardware"
+msgstr "硬件"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "HDK & CANN versions"
+msgstr "HDK 与 CANN 版本"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Export Command"
+msgstr "导出命令"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "800 I/T A3 series"
+msgstr "800 I/T A3 系列"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "HDK >= 26.0.0<br>CANN >= 9.0.0"
+msgstr "HDK >= 26.0.0<br>CANN >= 9.0.0"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`export ASCEND_ENABLE_USE_FABRIC_MEM=1`"
+msgstr "`export ASCEND_ENABLE_USE_FABRIC_MEM=1`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"**Recommended**. Enables unified memory address direct transmission "
+"scheme."
+msgstr "**推荐**。启用统一内存地址直传方案。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "25.5.0<=HDK<26.0.0"
+msgstr "25.5.0<=HDK<26.0.0"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`export ASCEND_BUFFER_POOL=4:8`"
+msgstr "`export ASCEND_BUFFER_POOL=4:8`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"Configures the number and size of buffers on the NPU Device for "
+"aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB)."
+msgstr "配置 NPU 设备上用于聚合和 KV 传输的缓冲区数量和大小（例如，`4:8` 表示 4 个 8MB 的缓冲区）。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "800 I/T A2 series"
+msgstr "800 I/T A2 系列"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "N/A"
+msgstr "不适用"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`export HCCL_INTRA_ROCE_ENABLE=1`"
+msgstr "`export HCCL_INTRA_ROCE_ENABLE=1`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Required by direct transmission cheme on 800 I/T A2 series"
+msgstr "800 I/T A2 系列直传方案所需"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:102
+msgid "FAQ for HIXL (ascend_direct) backend"
+msgstr "HIXL (ascend_direct) 后端常见问题"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:104
+#, python-format
+msgid ""
+"For common troubleshooting and issue localization guidance for HIXL "
+"(ascend_direct), see: "
+"<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
+msgstr ""
+"关于 HIXL (ascend_direct) 的常见故障排除和问题定位指南，请参阅："
+"<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:107
+msgid "Run Mooncake Master"
+msgstr "运行 Mooncake Master"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:109
+msgid "1.Configure mooncake.json"
+msgstr "1. 配置 mooncake.json"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:111
+msgid ""
+"The environment variable **MOONCAKE_CONFIG_PATH** is configured to the "
+"full path where mooncake.json is located."
+msgstr "环境变量 **MOONCAKE_CONFIG_PATH** 配置为 mooncake.json 所在位置的完整路径。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:123
+msgid ""
+"**metadata_server**: Configured as **P2PHANDSHAKE**.   **protocol:** Must"
+" be set to 'Ascend' on the NPU. **device_name**: \"\" "
+"**master_server_address**: Configured with the IP and port of the master "
+"service.   **global_segment_size**: Registered memory size per card to "
+"the KV Pool. **Needs to be aligned to 1GB.**"
+msgstr ""
+"**metadata_server**: 配置为 **P2PHANDSHAKE**。   **protocol:** 在 NPU 上必须设置为 'Ascend'。"
+"**device_name**: \"\"   **master_server_address**: 配置 master 服务的 IP 和端口。   "
+"**global_segment_size**: 每张卡注册到 KV Pool 的内存大小。**需要对齐到 1GB。**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:129
+msgid "2.Start mooncake_master"
+msgstr "2. 启动 mooncake_master"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:131
+msgid "Under the mooncake folder:"
+msgstr "在 mooncake 文件夹下："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:137
+msgid ""
+"`eviction_high_watermark_ratio` determines the watermark where Mooncake "
+"Store will perform eviction，and `eviction_ratio` determines the portion "
+"of stored objects that would be evicted. `default_kv_lease_ttl` controls "
+"the default lease TTL for KV objects (milliseconds); configure it via "
+"`--default_kv_lease_ttl` and keep it larger than `ASCEND_CONNECT_TIMEOUT`"
+" and `ASCEND_TRANSFER_TIMEOUT`."
+msgstr ""
+"`eviction_high_watermark_ratio` 决定了 Mooncake Store 执行淘汰的水位线，`eviction_ratio` 决定了将被淘汰的存储对象比例。"
+"`default_kv_lease_ttl` 控制 KV 对象的默认租约 TTL（毫秒）；通过 `--default_kv_lease_ttl` 配置，并保持其大于 "
+"`ASCEND_CONNECT_TIMEOUT` 和 `ASCEND_TRANSFER_TIMEOUT`。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:140
+#: ../../source/user_guide/feature_guide/kv_pool.md:603
+msgid "PD Disaggregation Scenario"
+msgstr "PD 解耦场景"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:142
+#: ../../source/user_guide/feature_guide/kv_pool.md:605
+msgid "1.Run `prefill` Node and `decode` Node"
+msgstr "1. 运行 `prefill` 节点和 `decode` 节点"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:144
+msgid ""
+"Using `MultiConnector` to simultaneously utilize both "
+"`MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` "
+"performs kv_transfer, while `AscendStoreConnector` serves as the prefix-"
+"cache node."
+msgstr ""
+"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 `AscendStoreConnector`。"
+"`MooncakeConnectorV1` 执行 kv_transfer，而 `AscendStoreConnector` 作为 prefix-cache 节点。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:146
+#: ../../source/user_guide/feature_guide/kv_pool.md:611
+#: ../../source/user_guide/feature_guide/kv_pool.md:771
+msgid "`prefill` Node："
+msgstr "`prefill` 节点："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:152
+msgid "The content of the multi_producer.sh script:"
+msgstr "multi_producer.sh 脚本的内容："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:224
+#: ../../source/user_guide/feature_guide/kv_pool.md:690
+#: ../../source/user_guide/feature_guide/kv_pool.md:841
+msgid "`decode` Node："
+msgstr "`decode` 节点："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:230
+msgid "The content of multi_consumer.sh:"
+msgstr "multi_consumer.sh 的内容："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:292
+msgid ""
+"Currently, the key-value pool in PD Disaggregate only stores the kv cache"
+" generated by the Prefill node by default. In models using MLA, it is now"
+" supported that the Decode node stores the kv cache for use by the "
+"Prefill node, enabled by adding `consumer_is_to_put: true` to the "
+"AscendStoreConnector. If the Prefill node enables PP, `prefill_pp_size` "
+"or `prefill_pp_layer_partition` also needs to be set. Example as follows:"
+msgstr ""
+"目前，PD 解耦中的键值池默认仅存储 Prefill 节点生成的 kv cache。在使用 MLA 的模型中，现已支持 Decode 节点存储 kv cache 供 "
+"Prefill 节点使用，通过在 AscendStoreConnector 中添加 `consumer_is_to_put: true` 来启用。如果 Prefill "
+"节点启用了 PP，则还需要设置 `prefill_pp_size` 或 `prefill_pp_layer_partition`。示例如下："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:308
+msgid "2、Start proxy_server"
+msgstr "2、启动 proxy_server"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:319
+msgid "Change localhost to your actual IP address."
+msgstr "将 localhost 更改为您的实际 IP 地址。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:321
+msgid "3.Run Inference"
+msgstr "3. 运行推理"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:323
+msgid ""
+"Configure the localhost, port, and model weight path in the command to "
+"your own settings."
+msgstr "将命令中的 localhost、端口和模型权重路径配置为您自己的设置。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:325
+#: ../../source/user_guide/feature_guide/kv_pool.md:388
+msgid "Short question:"
+msgstr "短问题："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:331
+#: ../../source/user_guide/feature_guide/kv_pool.md:394
+msgid "Long question:"
+msgstr "长问题："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:337
+msgid "PD-Mixed Inference"
+msgstr "PD混合推理"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:339
+#: ../../source/user_guide/feature_guide/kv_pool.md:916
+msgid "1.Run Mixed Department Script"
+msgstr "1. 运行混合部署脚本"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:345
+#: ../../source/user_guide/feature_guide/kv_pool.md:1056
+msgid "Content of pd_mix.sh:"
+msgstr "pd_mix.sh 内容："
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:384
+msgid "2.Run Inference"
+msgstr "2. 运行推理"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:386
+msgid ""
+"Configure the localhost, port, and model weight path in the command to "
+"your own settings. The requests sent will only go to the port where the "
+"mixed deployment script is located, and there is no need to start a "
+"separate proxy."
+msgstr "将命令中的 localhost、端口和模型权重路径配置为您自己的设置。发送的请求只会到达混合部署脚本所在的端口，无需启动单独的代理。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:400
+msgid ""
+"Note: For MooncakeStore with `ASCEND_BUFFER_POOL` enabled, it is "
+"recommended to perform a warm-up phase before running actual performance "
+"benchmarks."
+msgstr "注意：对于启用了 `ASCEND_BUFFER_POOL` 的 MooncakeStore，建议在实际运行性能基准测试之前进行预热阶段。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:402
+msgid ""
+"This is because HCCL one-sided communication connections are created "
+"lazily after the instance is launched when Device-to-Device communication"
+" is involved. Currently, full-mesh connections between all devices are "
+"required. Establishing these connections introduces a one-time time "
+"overhead and persistent device memory consumption (4 MB of device memory "
+"per connection)."
+msgstr "这是因为当涉及设备到设备通信时，HCCL 单边通信连接是在实例启动后延迟创建的。目前，需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗（每个连接消耗 4 MB 设备内存）。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:404
+msgid ""
+"**For warm-up, it is recommended to issue requests with an input sequence"
+" length of 8K and an output sequence length of 1, with the total number "
+"of requests being 2–3× the number of devices (cards/dies).**"
+msgstr "**对于预热，建议发送输入序列长度为 8K、输出序列长度为 1 的请求，请求总数为设备（卡/芯片）数量的 2-3 倍。**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:406
+msgid "Example of using Memcache as a KV Pool backend"
+msgstr "使用 Memcache 作为 KV 池后端的示例"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:408
+msgid "Installing Memcache"
+msgstr "安装 Memcache"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:410
+msgid ""
+"**MemCache depends on MemFabric. Therefore, MemFabric must be "
+"installed.Installing the memcache after the memfabric is installed.**"
+msgstr "**MemCache 依赖于 MemFabric。因此，必须先安装 MemFabric。在 memfabric 安装完成后，再安装 memcache。**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:412
+msgid ""
+"**memfabric_hybrid**: "
+"<https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
+msgstr "**memfabric_hybrid**: <https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:414
+msgid ""
+"**memcache**: "
+"<https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
+msgstr "**memcache**: <https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:416
+msgid "Configuring the memcache Config File"
+msgstr "配置 memcache 配置文件"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:419
+msgid ""
+"**Config file parameters "
+"description**：<https://gitcode.com/Ascend/memcache/blob/develop/doc/memcache_config.md>"
+msgstr "**配置文件参数说明**：<https://gitcode.com/Ascend/memcache/blob/develop/doc/memcache_config.md>"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:421
+msgid ""
+"Set TLS certificate configurations. If TLS is disabled, you do not need "
+"to upload a certificate. If TLS is enabled, you need to upload a "
+"certificate."
+msgstr "设置 TLS 证书配置。如果禁用 TLS，则无需上传证书。如果启用 TLS，则需要上传证书。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:434
+msgid ""
+"You are advised to copy mmc-local.conf and mmc-meta.conf to your own path"
+" and modify them, and set the MMC_META_CONFIG_PATH environment variable "
+"to the path of your own mmc-meta.conf file."
+msgstr "建议您将 mmc-local.conf 和 mmc-meta.conf 复制到您自己的路径并进行修改，并将 MMC_META_CONFIG_PATH 环境变量设置为您自己的 mmc-meta.conf 文件的路径。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:436
+msgid "**mmc-meta.conf：**"
+msgstr "**mmc-meta.conf：**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:485
+#: ../../source/user_guide/feature_guide/kv_pool.md:559
+msgid "**Key Focuses：**"
+msgstr "**关键要点：**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.meta_service_url`"
+msgstr "`ock.mmc.meta_service_url`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"Configure the IP address and port number of the master node. The IP "
+"address and port number of the P node and D node can be the same."
+msgstr "配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.meta_service.config_store_url`"
+msgstr "`ock.mmc.meta_service.config_store_url`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.meta.ha.enable`"
+msgstr "`ock.mmc.meta.ha.enable`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "Set to `false` to disable TLS authentication modification."
+msgstr "设置为 `false` 以禁用 TLS 认证修改。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.config_store.tls.enable`"
+msgstr "`ock.mmc.config_store.tls.enable`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:494
+msgid "**mmc-local.conf：**"
+msgstr "**mmc-local.conf：**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.local_service.config_store_url`"
+msgstr "`ock.mmc.local_service.config_store_url`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.local_service.world_size`"
+msgstr "`ock.mmc.local_service.world_size`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"Total count of local service, including services that will be added in "
+"the future."
+msgstr "本地服务的总数，包括未来将添加的服务。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.local_service.protocol`"
+msgstr "`ock.mmc.local_service.protocol`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"`host_rdma` (default), `device_rdma` (supported for A2 and A3 when device"
+" ROCE available, recommended for A2), `device_sdma` (supported for A3 "
+"when HCCS available, recommended for A3). Currently does not support "
+"heterogeneous protocol setting."
+msgstr "`host_rdma` (默认), `device_rdma` (A2 和 A3 在设备 ROCE 可用时支持，推荐用于 A2), `device_sdma` (A3 在 HCCS 可用时支持，推荐用于 A3)。目前不支持异构协议设置。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid "`ock.mmc.local_service.dram.size`"
+msgstr "`ock.mmc.local_service.dram.size`"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md
+msgid ""
+"Sets the size of the memory occupied by the master. The configured value "
+"is the size of the memory occupied by each card."
+msgstr "设置主节点占用的内存大小。配置的值为每张卡占用的内存大小。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:571
+msgid "Memcache environment variables"
+msgstr "Memcache 环境变量"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:580
+msgid "Run Memcache Master"
+msgstr "运行 Memcache 主节点"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:582
+msgid "Starting the MetaService service."
+msgstr "启动 MetaService 服务。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:593
+msgid "Method 2 for starting the MetaService service."
+msgstr "启动 MetaService 服务的方法 2。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:607
+msgid ""
+"Using `MultiConnector` to simultaneously utilize both "
+"`MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` "
+"performs kv_transfer, while `AscendStoreConnector` enables KV Cache Pool"
+msgstr "使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 `AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer，而 `AscendStoreConnector` 启用 KV 缓存池"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:609
+#: ../../source/user_guide/feature_guide/kv_pool.md:918
+msgid "800I A2/800T A2 Series"
+msgstr "800I A2/800T A2 系列"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:769
+#: ../../source/user_guide/feature_guide/kv_pool.md:1050
+msgid "800I A3/800T A3 Series"
+msgstr "800I A3/800T A3 系列"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:910
+msgid "[2、Start proxy_server](#2start-proxy_server)"
+msgstr "[2、启动 proxy_server](#2start-proxy_server)"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:912
+msgid "[3、run-inference](#3run-inference)"
+msgstr "[3、运行推理](#3run-inference)"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:914
+msgid "PD-Mixed Scenario"
+msgstr "PD混合场景"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:920
+msgid "The deepseek model needs to be run in a two-node cluster."
+msgstr "deepseek 模型需要在双节点集群中运行。"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:922
+msgid "**Run_pd_mix_1.sh:**"
+msgstr "**Run_pd_mix_1.sh:**"
+
+#: ../../source/user_guide/feature_guide/kv_pool.md:985
+msgid "**Run_pd_mix_2.sh:**"
+msgstr "**Run_pd_mix_2.sh:**"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po
@@ -0,0 +1,477 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:1
+msgid "Distributed DP Server With Large-Scale Expert Parallelism"
+msgstr "分布式数据并行服务器与大规模专家并行"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:3
+msgid "Getting Start"
+msgstr "快速开始"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:5
+msgid ""
+"vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-"
+"scale **Expert Parallelism (EP)** scenario. To achieve better "
+"performance, the distributed DP server is applied in vLLM-Ascend. In the "
+"PD separation scenario, different optimization strategies can be "
+"implemented based on the distinct characteristics of PD nodes, thereby "
+"enabling more flexible model deployment. Taking the DeepSeek model as an "
+"example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP"
+" of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the "
+"first 4 servers as prefiller nodes and the last 4 servers as decoder "
+"nodes. And the prefiller nodes are deployed as master nodes "
+"independently, while the decoder nodes use the 192.0.0.5 node as the "
+"master node."
+msgstr ""
+"vLLM-Ascend 现已支持在大规模**专家并行（EP）**场景下的预填充-解码（PD）解耦。为获得更好的性能，vLLM-Ascend 中应用了分布式数据并行服务器。在 PD 分离场景下，可以根据 PD 节点的不同特性实施不同的优化策略，从而实现更灵活的模型部署。以 DeepSeek 模型为例，使用 8 台 Atlas 800T A3 服务器部署模型。假设服务器 IP 从 192.0.0.1 开始到 192.0.0.8 结束。使用前 4 台服务器作为预填充节点，后 4 台服务器作为解码节点。并且预填充节点独立部署为主节点，而解码节点使用 192.0.0.5 节点作为主节点。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:8
+msgid "Verify Multi-Node Communication Environment"
+msgstr "验证多节点通信环境"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:10
+msgid "Physical Layer Requirements"
+msgstr "物理层要求"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:12
+msgid ""
+"The physical machines must be located on the same WLAN, with network "
+"connectivity."
+msgstr "物理机必须位于同一无线局域网内，并具备网络连通性。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:13
+msgid ""
+"All NPUs must be interconnected. For the Atlas A2 generation, intra-node "
+"connectivity is via HCCS, and inter-node connectivity is via RDMA. For "
+"the Atlas A3 generation, both intra-node and inter-node connectivity are "
+"via HCCS."
+msgstr ""
+"所有 NPU 必须互连。对于 Atlas A2 代，节点内连接通过 HCCS，节点间连接通过 RDMA。对于 Atlas A3 代，节点内和节点间连接均通过 HCCS。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:15
+msgid "Verification Process"
+msgstr "验证流程"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md
+msgid "A3"
+msgstr "A3"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:22
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:64
+msgid "Single Node Verification:"
+msgstr "单节点验证："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:24
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:66
+msgid ""
+"Execute the following commands on each node in sequence. The results must"
+" all be `success` and the status must be `UP`:"
+msgstr "依次在每个节点上执行以下命令。结果必须全部为 `success` 且状态必须为 `UP`："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:41
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:83
+msgid "Get NPU IP Addresses"
+msgstr "获取 NPU IP 地址"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:47
+msgid "Get superpodid and SDID"
+msgstr "获取 superpodid 和 SDID"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:53
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:89
+msgid "Cross-Node PING Test"
+msgstr "跨节点 PING 测试"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md
+msgid "A2"
+msgstr "A2"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:98
+msgid "Large-Scale EP model deployment"
+msgstr "大规模 EP 模型部署"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:100
+msgid "Generate script with configurations"
+msgstr "生成配置脚本"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:102
+msgid ""
+"In the PD separation scenario, we provide an optimized configuration. You"
+" can use the following shell script for configuring the prefiller and "
+"decoder nodes respectively."
+msgstr "在 PD 分离场景下，我们提供了优化配置。您可以使用以下 shell 脚本分别配置预填充节点和解码节点。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md
+msgid "Prefiller node"
+msgstr "预填充节点"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md
+msgid "Decoder node"
+msgstr "解码节点"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:241
+msgid "Start Distributed DP Server for prefill-decode disaggregation"
+msgstr "启动用于预填充-解码解耦的分布式数据并行服务器"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:243
+msgid ""
+"Execute the following Python file on all nodes to use the distributed DP "
+"server. (We recommend using this feature on the v0.9.1 official release)"
+msgstr "在所有节点上执行以下 Python 文件以使用分布式数据并行服务器。（我们建议在 v0.9.1 正式版本中使用此功能）"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:317
+msgid ""
+"Note that the prefiller nodes and the decoder nodes may have different "
+"configurations. In this example, each prefiller node is deployed as a "
+"master node independently, while the decoder nodes use the 192.0.0.5 node"
+" as the master node. This leads to differences in 'dp_size_local' and "
+"'dp_rank_start'"
+msgstr "请注意，预填充节点和解码节点可能具有不同的配置。在此示例中，每个预填充节点独立部署为主节点，而解码节点使用 192.0.0.5 节点作为主节点。这导致了 'dp_size_local' 和 'dp_rank_start' 的差异。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:319
+msgid "Example proxy for Distributed DP Server"
+msgstr "分布式数据并行服务器示例代理"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:321
+msgid ""
+"In the PD separation scenario, we need a proxy to distribute requests. "
+"Execute the following commands to enable the example proxy:"
+msgstr "在 PD 分离场景下，我们需要一个代理来分发请求。执行以下命令以启用示例代理："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Parameter"
+msgstr "参数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "meaning"
+msgstr "含义"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--port"
+msgstr "--port"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Proxy service Port"
+msgstr "代理服务端口"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--host"
+msgstr "--host"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Proxy service Host IP"
+msgstr "代理服务主机 IP"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--prefiller-hosts"
+msgstr "--prefiller-hosts"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Hosts of prefiller nodes"
+msgstr "预填充节点主机列表"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--prefiller-hosts-num"
+msgstr "--prefiller-hosts-num"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Number of repetitions for prefiller node hosts"
+msgstr "预填充节点主机重复次数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--prefiller-ports"
+msgstr "--prefiller-ports"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Ports of prefiller nodes"
+msgstr "预填充节点端口列表"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--prefiller-ports-inc"
+msgstr "--prefiller-ports-inc"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Number of increments for prefiller node ports"
+msgstr "预填充节点端口增量数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--decoder-hosts"
+msgstr "--decoder-hosts"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Hosts of decoder nodes"
+msgstr "解码节点主机列表"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--decoder-hosts-num"
+msgstr "--decoder-hosts-num"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Number of repetitions for decoder node hosts"
+msgstr "解码节点主机重复次数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--decoder-ports"
+msgstr "--decoder-ports"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Ports of decoder nodes"
+msgstr "解码节点端口列表"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "--decoder-ports-inc"
+msgstr "--decoder-ports-inc"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "Number of increments for decoder node ports"
+msgstr "解码节点端口增量数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:364
+msgid ""
+"You can get the proxy program in the repository's examples, "
+"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
+"project/vllm-"
+"ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)"
+msgstr "您可以在仓库的示例中找到代理程序，[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:366
+msgid "Benchmark"
+msgstr "基准测试"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:368
+msgid ""
+"We recommend using aisbench tool to assess performance. "
+"[aisbench](https://gitee.com/aisbench/benchmark). Execute the following "
+"commands to install aisbench"
+msgstr "我们推荐使用 aisbench 工具评估性能。[aisbench](https://gitee.com/aisbench/benchmark)。执行以下命令安装 aisbench"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:376
+msgid ""
+"You need to cancel the http proxy before assessing performance, as "
+"follows:"
+msgstr "在评估性能前，您需要取消 http 代理，如下所示："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:384
+msgid ""
+"You can place your datasets in the directory: "
+"`benchmark/ais_bench/datasets`"
+msgstr "您可以将数据集放置在目录：`benchmark/ais_bench/datasets` 中"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:385
+msgid ""
+"You can change the configuration in the directory "
+":`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take "
+"`vllm_api_stream_chat.py` as an example:"
+msgstr "您可以在目录：`benchmark/ais_bench/benchmark/configs/models/vllm_api` 中更改配置。以 `vllm_api_stream_chat.py` 为例："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:411
+msgid ""
+"Taking the gsm8k dataset as an example, execute the following commands to"
+" assess performance."
+msgstr "以 gsm8k 数据集为例，执行以下命令评估性能。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:417
+msgid ""
+"For more details on commands and parameters for aisbench, refer to "
+"[aisbench](https://gitee.com/aisbench/benchmark)"
+msgstr "有关 aisbench 命令和参数的更多详细信息，请参考 [aisbench](https://gitee.com/aisbench/benchmark)"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:419
+msgid "Prefill & Decode Configuration Details"
+msgstr "预填充与解码配置详情"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:421
+msgid "In the PD separation scenario, we provide an optimized configuration."
+msgstr "在 PD 分离场景下，我们提供了优化配置。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:423
+msgid "**prefiller node**"
+msgstr "**预填充节点**"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:425
+msgid "set HCCL_BUFFSIZE=256"
+msgstr "设置 HCCL_BUFFSIZE=256"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:426
+msgid "add '--enforce-eager' command to 'vllm serve'"
+msgstr "向 'vllm serve' 添加 '--enforce-eager' 命令"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:427
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:449
+msgid "Take '--kv-transfer-config' as follows:"
+msgstr "按如下方式设置 '--kv-transfer-config'："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:440
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:462
+msgid "Take '--additional-config' as follows:"
+msgstr "按如下方式设置 '--additional-config'："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:446
+msgid "**decoder node**"
+msgstr "**解码节点**"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:448
+msgid "set HCCL_BUFFSIZE=1024"
+msgstr "设置 HCCL_BUFFSIZE=1024"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:468
+msgid "Parameters Description"
+msgstr "参数说明"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:470
+msgid "'--additional-config' Parameter Introduction:"
+msgstr "'--additional-config' 参数介绍："
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:472
+msgid ""
+"**\"enable_weight_nz_layout\"**: Whether to convert quantized weights to "
+"NZ format to accelerate matrix multiplication."
+msgstr "**\"enable_weight_nz_layout\"**：是否将量化权重转换为 NZ 格式以加速矩阵乘法。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:473
+msgid ""
+"**\"enable_prefill_optimizations\"**: Whether to enable DeepSeek models' "
+"prefill optimizations. <br>"
+msgstr "**\"enable_prefill_optimizations\"**：是否启用 DeepSeek 模型的预填充优化。<br>"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:476
+msgid "Enable MTP Add the following command to your configurations."
+msgstr "启用 MTP 在您的配置中添加以下命令。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:483
+msgid "Recommended Configuration Example"
+msgstr "推荐配置示例"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:485
+msgid ""
+"For example, if the average input length is 3.5k, and the output length "
+"is 1.1k, the context length is 16k, the max length of the input dataset "
+"is 7K. In this scenario, we give a recommended configuration for "
+"distributed DP server with high EP. Here we use 4 nodes for prefill and 4"
+" nodes for decode."
+msgstr "例如，如果平均输入长度为 3.5k，输出长度为 1.1k，上下文长度为 16k，输入数据集的最大长度为 7K。在此场景下，我们为具有高 EP 的分布式数据并行服务器提供了一个推荐配置。这里我们使用 4 个节点进行预填充，4 个节点进行解码。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "node"
+msgstr "节点"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "DP"
+msgstr "数据并行"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "TP"
+msgstr "张量并行"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "EP"
+msgstr "专家并行"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "max-model-len"
+msgstr "最大模型长度"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "max-num-batched-tokens"
+msgstr "最大批处理令牌数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "max-num-seqs"
+msgstr "最大序列数"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "gpu-memory-utilization"
+msgstr "GPU内存利用率"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "prefill"
+msgstr "预填充"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "2"
+msgstr "2"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "8"
+msgstr "8"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "16"
+msgstr "16"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "17000"
+msgstr "17000"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "16384"
+msgstr "16384"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "4"
+msgstr "4"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "0.9"
+msgstr "0.9"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "decode"
+msgstr "解码"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "64"
+msgstr "64"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "1"
+msgstr "1"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "256"
+msgstr "256"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
+msgid "28"
+msgstr "28"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:493
+msgid ""
+"Note that these configurations are not related to optimization. You need "
+"to adjust these parameters based on actual scenarios."
+msgstr "请注意，这些配置与优化无关。您需要根据实际场景调整这些参数。"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:496
+msgid "FAQ"
+msgstr "常见问题"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:498
+msgid "1. Prefiller nodes need to warm up"
+msgstr "1. 预填充节点需要预热"
+
+#: ../../source/user_guide/feature_guide/large_scale_ep.md:500
+msgid ""
+"Since the computation of some NPU operators requires several rounds of "
+"warm-up to achieve best performance, we recommend preheating the service "
+"with some requests before conducting performance tests to achieve the "
+"best end-to-end throughput."
+msgstr "由于部分NPU算子的计算需要经过数轮预热才能达到最佳性能，我们建议在进行性能测试前，先用一些请求预热服务，以达到最佳的端到端吞吐量。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po
@@ -0,0 +1,185 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:1
+msgid "Layer Sharding Linear Guide"
+msgstr "层分片线性算子指南"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:5
+msgid ""
+"**Layer Shard Linear** is a memory-optimization feature designed for "
+"large language model (LLM) inference. It addresses the high memory "
+"pressure caused by **repeated linear operators across many layers** that "
+"share identical structure but have distinct weights."
+msgstr ""
+"**层分片线性算子** 是一项为大语言模型推理设计的内存优化功能。它旨在解决由**跨越多层的重复线性算子**所引起的高内存压力，这些算子结构相同但权重不同。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:7
+msgid ""
+"Instead of replicating all weights on every device, **Layer Shard Linear "
+"shards the weights of a \"series\" of such operators across the NPU "
+"devices in a communication group**:"
+msgstr ""
+"与在每个设备上复制所有权重不同，**层分片线性算子将此类算子的一个\"系列\"的权重分片到通信组内的NPU设备上**："
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:9
+msgid ""
+"The **i-th layer's linear weight** is stored **only on device `i % K`**, "
+"where `K` is the number of devices in the group."
+msgstr ""
+"**第 i 层的线性权重** **仅存储在设备 `i % K` 上**，其中 `K` 是组内的设备数量。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:10
+msgid ""
+"Other devices hold a lightweight **shared dummy tensor** during "
+"initialization and fetch the real weight **on-demand** via asynchronous "
+"broadcast during the forward pass."
+msgstr ""
+"其他设备在初始化期间持有一个轻量级的**共享虚拟张量**，并在前向传播期间通过异步广播**按需**获取真实权重。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:12
+msgid ""
+"As illustrated in the figure below, this design enables broadcast to "
+"reach weights: while the current layer (e.g., MLA or MOE) is being "
+"computed, the system **asynchronously broadcasts the next layer's "
+"weight** in the background. Because the attention computation in the MLA "
+"module is sufficiently latency-bound, the weight transfer for `o_proj` is"
+" **fully overlapped with computation**, making the communication "
+"**latency-free from the perspective of end-to-end inference**."
+msgstr ""
+"如下图所示，这种设计使得广播能够触及权重：在当前层（例如MLA或MOE）进行计算时，系统在后台**异步广播下一层的权重**。由于MLA模块中的注意力计算是充分延迟受限的，`o_proj`的权重传输**与计算完全重叠**，使得从端到端推理的角度看，通信**没有额外延迟**。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:14
+msgid ""
+"This approach **preserves exact computational semantics** while "
+"**significantly reducing NPU memory footprint**, especially critical for:"
+msgstr ""
+"这种方法**保持了精确的计算语义**，同时**显著减少了NPU内存占用**，这对于以下情况尤其关键："
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:16
+msgid "Extremely deep architectures (e.g., DeepSeek-V3/R1 with 61 layers);"
+msgstr "极深的架构（例如，具有61层的DeepSeek-V3/R1）；"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:17
+msgid ""
+"Models using **[DSA-CP](https://github.com/vllm-project/vllm-"
+"ascend/pull/4702)** or **[FlashComm2](https://github.com/vllm-project"
+"/vllm-ascend/pull/4188)**, where the full `O` (output) projection matrix "
+"must reside in memory per layer;"
+msgstr ""
+"使用 **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** 或 **[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)** 的模型，其中完整的`O`（输出）投影矩阵必须驻留在每层的内存中；"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:18
+msgid ""
+"Scenarios where **attention computation latency fully overlaps** (hides) "
+"the communication cost of weight broadcasting."
+msgstr "**注意力计算延迟完全覆盖（隐藏）**权重广播通信成本的场景。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:22
+msgid "Flowchart"
+msgstr "流程图"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:24
+msgid "![layer shard](./images/layer_sharding.png)"
+msgstr "![层分片](./images/layer_sharding.png)"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:24
+msgid "layer shard"
+msgstr "层分片"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:26
+msgid ""
+"**Figure.** Layer Shard Linear workflow: weights are sharded by layer "
+"across devices (top), and during forward execution (bottom), asynchronous"
+" broadcast **pre-fetches** the next layer's weight while the current "
+"layer computes—enabling **zero-overhead** weight loading."
+msgstr ""
+"**图.** 层分片线性算子工作流程：权重按层分片到各设备（顶部），在前向执行期间（底部），异步广播**预取**下一层的权重，同时当前层进行计算——实现**零开销**的权重加载。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:30
+msgid "Getting Started"
+msgstr "快速开始"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:32
+msgid ""
+"To enable **Layer Shard Linear**, specify the target linear layers using "
+"the `--additional-config` argument when launching your inference job. For"
+" example, to shard the `o_proj` and `q_b_proj` layers, use:"
+msgstr ""
+"要启用**层分片线性算子**，请在启动推理作业时使用 `--additional-config` 参数指定目标线性层。例如，要对 `o_proj` 和 `q_b_proj` 层进行分片，请使用："
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:40
+msgid ""
+"**Restriction** In PD-disaggregated deployments, Layer Sharding can only "
+"be enabled on the **P node** with `kv_role=\"kv_producer\"`. "
+"`kv_role=\"kv_consumer\"` and `kv_role=\"kv_both\"` are not supported."
+msgstr ""
+"**限制** 在PD解耦部署中，层分片只能在 `kv_role=\"kv_producer\"` 的 **P节点** 上启用。不支持 `kv_role=\"kv_consumer\"` 和 `kv_role=\"kv_both\"`。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:46
+msgid "Supported Scenarios"
+msgstr "支持场景"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:48
+msgid "This feature delivers the greatest benefit in the following cases:"
+msgstr "此功能在以下情况下能带来最大收益："
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:50
+msgid "FlashComm2-enabled"
+msgstr "启用FlashComm2"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:52
+msgid ""
+"When using [FlashComm2](https://github.com/vllm-project/vllm-"
+"ascend/pull/4188), the full output projection (`o_proj`) matrix must be "
+"resident in memory for each layer. Layer sharding significantly reduces "
+"memory pressure by distributing these weights across devices."
+msgstr ""
+"当使用 [FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188) 时，完整的输出投影（`o_proj`）矩阵必须驻留在每层的内存中。层分片通过将这些权重分布到各设备上，显著降低了内存压力。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:54
+#: ../../source/user_guide/feature_guide/layer_sharding.md:71
+msgid "**Example configuration:**"
+msgstr "**配置示例：**"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:65
+msgid "DSA-CP-enabled"
+msgstr "启用DSA-CP"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:67
+msgid ""
+"With [DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702), "
+"both `q_b_proj` and `o_proj` layers require large weight matrices to be "
+"stored per layer. Sharding these layers across NPUs helps fit extremely "
+"deep models (e.g., 61-layer architectures) into limited device memory."
+msgstr ""
+"使用 [DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702) 时，`q_b_proj` 和 `o_proj` 层都需要每层存储大型权重矩阵。将这些层分片到多个NPU上有助于将极深的模型（例如，61层架构）装入有限的设备内存中。"
+
+#: ../../source/user_guide/feature_guide/layer_sharding.md:69
+msgid ""
+"In PD-disaggregated deployments, this mode is supported only on the **P "
+"node** with `kv_role=\"kv_producer\"`."
+msgstr ""
+"在PD解耦部署中，此模式仅在 `kv_role=\"kv_producer\"` 的 **P节点** 上受支持。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lmcache_ascend_deployment.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lmcache_ascend_deployment.po
@@ -0,0 +1,100 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:1
+msgid "LMCache-Ascend Deployment Guide"
+msgstr "LMCache-Ascend 部署指南"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:5
+msgid ""
+"LMCache-Ascend is a community maintained plugin for running LMCache on "
+"the Ascend NPU."
+msgstr "LMCache-Ascend 是一个社区维护的插件，用于在昇腾 NPU 上运行 LMCache。"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:7
+msgid ""
+"We provide a simple deployment guide here. For further info about "
+"deployment notes, please refer to [LMCache-Ascend "
+"doc](https://github.com/LMCache/LMCache-Ascend/blob/main/README.md)"
+msgstr "本文提供一份简明的部署指南。关于部署的更多详细信息，请参阅 [LMCache-Ascend 文档](https://github.com/LMCache/LMCache-Ascend/blob/main/README.md)。"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:9
+msgid "Getting Started"
+msgstr "快速开始"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:11
+msgid "Clone LMCache-Ascend Repo"
+msgstr "克隆 LMCache-Ascend 仓库"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:13
+msgid ""
+"Our repo contains a kvcache ops submodule for ease of maintenance, "
+"therefore we recommend cloning the repo with submodules."
+msgstr "我们的仓库包含一个 kvcache 算子子模块以便于维护，因此我们建议克隆包含子模块的仓库。"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:20
+msgid "Docker"
+msgstr "Docker"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:27
+msgid "Once that is built, run it with the following cmd"
+msgstr "构建完成后，使用以下命令运行"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:53
+msgid "Manual Installation"
+msgstr "手动安装"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:55
+msgid ""
+"Assuming your working directory is ```/workspace``` and vllm/vllm-ascend "
+"have already been installed."
+msgstr "假设您的工作目录是 ```/workspace``` 且 vllm/vllm-ascend 已安装。"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:57
+msgid "Install LMCache Repo"
+msgstr "安装 LMCache 仓库"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:63
+msgid "Install LMCache-Ascend Repo"
+msgstr "安装 LMCache-Ascend 仓库"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:70
+msgid "Usage"
+msgstr "使用方法"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:72
+msgid ""
+"We introduce a dynamic KVConnector via LMCacheAscendConnectorV1Dynamic, "
+"therefore LMCache-Ascend Connector can be used via the kv transfer config"
+" in the two following setting."
+msgstr "我们通过 LMCacheAscendConnectorV1Dynamic 引入了一个动态 KVConnector，因此 LMCache-Ascend 连接器可以通过以下两种场景下的 kv 传输配置来使用。"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:74
+msgid "Online serving"
+msgstr "在线服务"
+
+#: ../../source/user_guide/feature_guide/lmcache_ascend_deployment.md:87
+msgid "Offline"
+msgstr "离线"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lora.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/lora.po
@@ -4,55 +4,102 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/feature_guide/lora.md:1
+#: ../../source/user_guide/feature_guide/lora.md:1
 msgid "LoRA Adapters Guide"
 msgstr "LoRA 适配器指南"

-#: ../../user_guide/feature_guide/lora.md:3
+#: ../../source/user_guide/feature_guide/lora.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/lora.md:5
 msgid ""
-"Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can"
-" be found in [vLLM official "
+"Like vLLM, vllm-ascend supports LoRA as well. The usage and more details "
+"can be found in [vLLM official "
 "document](https://docs.vllm.ai/en/latest/features/lora.html)."
 msgstr ""
 "与 vLLM 类似，vllm-ascend 也支持 LoRA。用法及更多详情可参见 [vLLM "
 "官方文档](https://docs.vllm.ai/en/latest/features/lora.html)。"

-#: ../../user_guide/feature_guide/lora.md:5
+#: ../../source/user_guide/feature_guide/lora.md:7
 msgid ""
-"You can also refer to "
-"[this](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-"
-"text-only-language-models) to find which models support LoRA in vLLM."
+"You can refer to [Supported "
+"Models](https://docs.vllm.ai/en/latest/models/supported_models.html#list-"
+"of-text-only-language-models) to find which models support LoRA in vLLM."
 msgstr ""
-"你也可以参考[这个链接](https://docs.vllm.ai/en/latest/models/supported_models.html#list-"
-"of-text-only-language-models)来查找哪些模型在 vLLM 中支持 LoRA。"
+"你可以参考[支持的模型](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)来查找 vLLM 中哪些模型支持 LoRA。"

-#: ../../user_guide/feature_guide/lora.md:7
-msgid "Tips"
-msgstr "提示"
-
-#: ../../user_guide/feature_guide/lora.md:8
+#: ../../source/user_guide/feature_guide/lora.md:9
 msgid ""
-"If you fail to run vllm-ascend with LoRA, you may follow [this "
-"instruction](https://vllm-"
-"ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html#fallback-"
-"to-eager-mode) to disable graph mode and try again."
-msgstr ""
-"如果你在使用 LoRA 运行 vllm-ascend 时失败，可以按照[此说明](https://vllm-"
-"ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html#fallback-"
-"to-eager-mode)禁用图模式后再重试。"
+"You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode "
+"Guide](./graph_mode.md) for a better LoRA performance."
+msgstr "你现在可以在 ACLGraph 模式下运行 LoRA。请参考[图模式指南](./graph_mode.md)以获得更好的 LoRA 性能。"
+
+#: ../../source/user_guide/feature_guide/lora.md:11
+msgid "Address for downloading models:"
+msgstr "模型下载地址："
+
+#: ../../source/user_guide/feature_guide/lora.md:13
+msgid ""
+"base model: <https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-"
+"hf/files>"
+msgstr "基础模型：<https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files>"
+
+#: ../../source/user_guide/feature_guide/lora.md:14
+msgid ""
+"lora model: <https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-"
+"lora-test/files>"
+msgstr "LoRA 模型：<https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files>"
+
+#: ../../source/user_guide/feature_guide/lora.md:16
+msgid "Example"
+msgstr "示例"
+
+#: ../../source/user_guide/feature_guide/lora.md:18
+msgid ""
+"We provide a simple LoRA example here, which enables the ACLGraph mode by"
+" default."
+msgstr "我们在此提供了一个简单的 LoRA 示例，该示例默认启用 ACLGraph 模式。"
+
+#: ../../source/user_guide/feature_guide/lora.md:26
+msgid "Custom LoRA Operators"
+msgstr "自定义 LoRA 算子"
+
+#: ../../source/user_guide/feature_guide/lora.md:28
+msgid ""
+"We have implemented LoRA-related AscendC operators, such as bgmv_shrink, "
+"bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "
+"\"csrc/kernels\" directory of [vllm-ascend repo](https://github.com/vllm-"
+"project/vllm-ascend.git)."
+msgstr "我们已经实现了与 LoRA 相关的 AscendC 算子，例如 bgmv_shrink、bgmv_expand、sgmv_shrink 和 sgmv_expand。你可以在 [vllm-ascend 代码库](https://github.com/vllm-project/vllm-ascend.git) 的 \"csrc/kernels\" 目录下找到它们。"
+
+#~ msgid "Tips"
+#~ msgstr "提示"
+
+#~ msgid ""
+#~ "If you fail to run vllm-ascend "
+#~ "with LoRA, you may follow [this "
+#~ "instruction](https://vllm-"
+#~ "ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html"
+#~ "#fallback-to-eager-mode) to disable "
+#~ "graph mode and try again."
+#~ msgstr ""
+#~ "如果你在使用 LoRA 运行 vllm-ascend "
+#~ "时失败，可以按照[此说明](https://vllm-"
+#~ "ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html"
+#~ "#fallback-to-eager-mode)禁用图模式后再重试。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po
@@ -0,0 +1,341 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/netloader.md:1
+msgid "Netloader Guide"
+msgstr "网络加载器指南"
+
+#: ../../source/user_guide/feature_guide/netloader.md:3
+msgid ""
+"This guide provides instructions for using **Netloader** as a weight-"
+"loader plugin for acceleration in **vLLM Ascend**."
+msgstr "本指南介绍如何将 **Netloader** 用作权重加载器插件，以在 **vLLM Ascend** 中实现加速。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:7
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/netloader.md:9
+msgid ""
+"Netloader leverages high-bandwidth peer-to-peer (P2P) transfers between "
+"NPU cards to load model weights. It is implemented as a plugin (via the "
+"`register_model_loader` API added in vLLM 0.10). The workflow is:"
+msgstr "Netloader 利用 NPU 卡之间的高带宽点对点 (P2P) 传输来加载模型权重。它通过插件实现（使用 vLLM 0.10 中添加的 `register_model_loader` API）。工作流程如下："
+
+#: ../../source/user_guide/feature_guide/netloader.md:11
+msgid "A **server** preloads a model."
+msgstr "**服务器** 预加载模型。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:12
+msgid "A new **client** instance requests weight transfer."
+msgstr "新的 **客户端** 实例请求权重传输。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:13
+msgid ""
+"After validating that the model and partitioning match, the client uses "
+"HCCL collective communication (send/recv) to receive weights in the same "
+"order as stored in the model."
+msgstr "在验证模型和分区匹配后，客户端使用 HCCL 集合通信 (send/recv) 按照模型中存储的相同顺序接收权重。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:15
+msgid ""
+"The server runs alongside normal inference tasks via sub-threads and via "
+"`stateless_init_torch_distributed_process_group` in vLLM. The client thus"
+" takes over weight initialization without needing to load from storage."
+msgstr "服务器通过子线程以及 vLLM 中的 `stateless_init_torch_distributed_process_group` 与常规推理任务并行运行。因此，客户端接管权重初始化，无需从存储加载。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:17
+msgid "Flowchart"
+msgstr "流程图"
+
+#: ../../source/user_guide/feature_guide/netloader.md:19
+msgid "![netloader flowchart](./images/netloader_flowchart.png)"
+msgstr "![网络加载器流程图](./images/netloader_flowchart.png)"
+
+#: ../../source/user_guide/feature_guide/netloader.md:19
+msgid "netloader flowchart"
+msgstr "网络加载器流程图"
+
+#: ../../source/user_guide/feature_guide/netloader.md:21
+msgid "Timing Diagram"
+msgstr "时序图"
+
+#: ../../source/user_guide/feature_guide/netloader.md:23
+msgid "![netloader timing diagram](./images/netloader_timing_diagram.png)"
+msgstr "![网络加载器时序图](./images/netloader_timing_diagram.png)"
+
+#: ../../source/user_guide/feature_guide/netloader.md:23
+msgid "netloader timing diagram"
+msgstr "网络加载器时序图"
+
+#: ../../source/user_guide/feature_guide/netloader.md:25
+msgid "Application Scenarios"
+msgstr "应用场景"
+
+#: ../../source/user_guide/feature_guide/netloader.md:27
+msgid ""
+"**Reduce startup latency**: By reusing already loaded weights and "
+"transferring them directly between NPU cards, Netloader cuts down model "
+"loading time versus conventional remote/local pull strategies."
+msgstr "**减少启动延迟**：通过重用已加载的权重并在 NPU 卡之间直接传输，Netloader 相比传统的远程/本地拉取策略，缩短了模型加载时间。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:28
+msgid ""
+"**Relieve network & storage load**: Avoid repeated downloads of weight "
+"files from remote repositories, thus reducing pressure on central storage"
+" and network traffic."
+msgstr "**减轻网络和存储负载**：避免从远程仓库重复下载权重文件，从而减轻中心存储和网络流量的压力。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:29
+msgid ""
+"**Improve resource utilization & lower cost**: Faster loading allows less"
+" reliance on standby compute nodes; resources can be scaled up/down more "
+"flexibly."
+msgstr "**提高资源利用率并降低成本**：更快的加载速度减少了对备用计算节点的依赖；资源可以更灵活地伸缩。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:30
+msgid ""
+"**Enhance business continuity & high availability**: In failure recovery,"
+" new instances can quickly take over without long downtime, improving "
+"system reliability and user experience."
+msgstr "**增强业务连续性和高可用性**：在故障恢复时，新实例可以快速接管而无需长时间停机，从而提高系统可靠性和用户体验。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:34
+msgid "Usage"
+msgstr "使用方法"
+
+#: ../../source/user_guide/feature_guide/netloader.md:36
+msgid ""
+"To enable Netloader, pass `--load-format=netloader` and provide "
+"configuration via `--model-loader-extra-config` (as a JSON string). Below"
+" are the supported configuration fields:"
+msgstr "要启用 Netloader，请传递 `--load-format=netloader` 并通过 `--model-loader-extra-config`（作为 JSON 字符串）提供配置。以下是支持的配置字段："
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Field Name"
+msgstr "字段名"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Type"
+msgstr "类型"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Description"
+msgstr "描述"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Allowed Values / Notes"
+msgstr "允许值 / 备注"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**SOURCE**"
+msgstr "**SOURCE**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "List"
+msgstr "列表"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+#, python-brace-format
+msgid ""
+"Weight data sources. Each item is a map with `device_id` and `sources`, "
+"specifying the rank and its endpoints (IP:port). <br>Example: "
+"`{\"SOURCE\": [{\"device_id\": 0, \"sources\": "
+"[\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": "
+"[\"10.170.22.152:11228\"]}]}` <br>If omitted or empty, fallback to "
+"default loader. The SOURCE here is second priority."
+msgstr "权重数据源。每个条目是一个包含 `device_id` 和 `sources` 的映射，指定了 rank 及其端点 (IP:端口)。<br>示例：`{\"SOURCE\": [{\"device_id\": 0, \"sources\": [\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": [\"10.170.22.152:11228\"]}]}` <br>如果省略或为空，则回退到默认加载器。此处的 SOURCE 是第二优先级。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "A list of objects with keys `device_id: int` and `sources: List[str]`"
+msgstr "一个对象列表，其键为 `device_id: int` 和 `sources: List[str]`"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**MODEL**"
+msgstr "**MODEL**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "String"
+msgstr "字符串"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "The model name, used to verify consistency between client and server."
+msgstr "模型名称，用于验证客户端和服务器之间的一致性。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Defaults to the `--model` argument if not specified."
+msgstr "如果未指定，则默认为 `--model` 参数。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**LISTEN_PORT**"
+msgstr "**LISTEN_PORT**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Integer"
+msgstr "整数"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Base port for the server listener."
+msgstr "服务器监听器的基础端口。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid ""
+"The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port "
+"is chosen. Valid range: 1024–65535. If out of range, that server instance"
+" won’t open a listener."
+msgstr "实际端口 = `LISTEN_PORT + RANK`。如果省略，则选择一个随机有效端口。有效范围：1024–65535。如果超出范围，该服务器实例将不会打开监听器。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**INT8_CACHE**"
+msgstr "**INT8_CACHE**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Behavior for handling int8 parameters in quantized models."
+msgstr "处理量化模型中 int8 参数的行为。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid ""
+"One of `[\"hbm\", \"dram\", \"no\"]`. <br> - `hbm`: copy original int8 "
+"parameters to high-bandwidth memory (HBM) (may cost a lot of HBM). <br> -"
+" `dram`: copy to DRAM. <br> - `no`: no special handling (may lead to "
+"divergence or unpredictable behavior). Default: `\"no\"`."
+msgstr "取值为 `[\"hbm\", \"dram\", \"no\"]` 之一。<br> - `hbm`：将原始 int8 参数复制到高带宽内存 (HBM)（可能消耗大量 HBM）。<br> - `dram`：复制到 DRAM。<br> - `no`：不进行特殊处理（可能导致分歧或不可预测的行为）。默认值：`\"no\"`。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**INT8_CACHE_NAME**"
+msgstr "**INT8_CACHE_NAME**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Names of parameters to which `INT8_CACHE` is applied (i.e. filtering)."
+msgstr "应用 `INT8_CACHE` 的参数名称（即过滤）。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Default: `None` (means no filtering—all parameters)."
+msgstr "默认值：`None`（表示不过滤——所有参数）。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**OUTPUT_PREFIX**"
+msgstr "**OUTPUT_PREFIX**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Prefix for writing per-rank listener address/port files in server mode."
+msgstr "在服务器模式下，用于写入每个 rank 监听器地址/端口文件的前缀。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+#, python-brace-format
+msgid ""
+"If set, each rank writes to `{OUTPUT_PREFIX}{RANK}.txt` (text), content ="
+" `IP:Port`."
+msgstr "如果设置，每个 rank 将写入 `{OUTPUT_PREFIX}{RANK}.txt`（文本文件），内容为 `IP:Port`。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "**CONFIG_FILE**"
+msgstr "**CONFIG_FILE**"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid "Path to a JSON file specifying the above configuration."
+msgstr "指定上述配置的 JSON 文件路径。"
+
+#: ../../source/user_guide/feature_guide/netloader.md
+msgid ""
+"If provided, the SOURCE inside this file has **first priority** "
+"(overrides SOURCE in other configs)."
+msgstr "如果提供，此文件内的 SOURCE 具有 **最高优先级**（覆盖其他配置中的 SOURCE）。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:50
+msgid "Example Commands & Placeholders"
+msgstr "示例命令与占位符"
+
+#: ../../source/user_guide/feature_guide/netloader.md:52
+msgid "Replace parts in `` `<...>` `` before running."
+msgstr "运行前替换 `` `<...>` `` 中的部分。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:54
+msgid "Server"
+msgstr "服务器"
+
+#: ../../source/user_guide/feature_guide/netloader.md:65
+msgid "Client"
+msgstr "客户端"
+
+#: ../../source/user_guide/feature_guide/netloader.md:80
+msgid "Placeholder Descriptions"
+msgstr "占位符说明"
+
+#: ../../source/user_guide/feature_guide/netloader.md:82
+msgid "`<model_file>`: Path to the model file"
+msgstr "`<model_file>`：模型文件路径"
+
+#: ../../source/user_guide/feature_guide/netloader.md:83
+msgid "`<model_name>`: Model name (must match between server & client)"
+msgstr "`<model_name>`：模型名称（服务器和客户端之间必须匹配）"
+
+#: ../../source/user_guide/feature_guide/netloader.md:84
+msgid "`<port>`: Base listening port on server"
+msgstr "`<port>`：服务器上的基础监听端口"
+
+#: ../../source/user_guide/feature_guide/netloader.md:85
+msgid ""
+"`<server_IP>` + `<server_Port>`: IP and port of the Netloader server "
+"(from server log)"
+msgstr "`<server_IP>` + `<server_Port>`：Netloader 服务器的 IP 和端口（来自服务器日志）"
+
+#: ../../source/user_guide/feature_guide/netloader.md:86
+msgid ""
+"`<device_id_diff_from_server>`: Client device ID (must differ from "
+"server’s)"
+msgstr "`<device_id_diff_from_server>`：客户端设备 ID（必须与服务器的不同）"
+
+#: ../../source/user_guide/feature_guide/netloader.md:87
+msgid "`<client_port>`: Port on which client listens"
+msgstr "`<client_port>`：客户端监听的端口"
+
+#: ../../source/user_guide/feature_guide/netloader.md:89
+msgid ""
+"After startup, you can test consistency by issuing inference requests "
+"with temperature = 0 and comparing outputs."
+msgstr "启动后，您可以通过发送 temperature = 0 的推理请求并比较输出来测试一致性。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:93
+msgid "Note & Caveats"
+msgstr "注意事项与限制"
+
+#: ../../source/user_guide/feature_guide/netloader.md:95
+msgid ""
+"If Netloader is used, **each worker process** must bind a listening port."
+" That port may be user-specified or assigned randomly. If user-specified,"
+" ensure it is available."
+msgstr "如果使用 Netloader，**每个工作进程** 都必须绑定一个监听端口。该端口可以是用户指定的，也可以是随机分配的。如果是用户指定的，请确保其可用。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:96
+msgid ""
+"Netloader requires extra HBM memory to establish HCCL connections (i.e. "
+"`HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient "
+"capacity (e.g. via `--gpu-memory-utilization`)."
+msgstr "Netloader 需要额外的 HBM 内存来建立 HCCL 连接（即 `HCCL_BUFFERSIZE`，默认约 200 MB）。用户应预留足够的容量（例如通过 `--gpu-memory-utilization`）。"
+
+#: ../../source/user_guide/feature_guide/netloader.md:97
+msgid ""
+"It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or"
+" slow connections/transmissions. Related info: [vLLM Issue "
+"#16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR "
+"#16226](https://github.com/vllm-project/vllm/pull/16226)."
+msgstr "建议设置 `VLLM_SLEEP_WHEN_IDLE=1` 以缓解不稳定或缓慢的连接/传输。相关信息：[vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po
@@ -0,0 +1,61 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:1
+msgid "Npugraph_ex"
+msgstr "Npugraph_ex"
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:3
+msgid "Introduction"
+msgstr "简介"
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:5
+msgid ""
+"As introduced in the [RFC](https://github.com/vllm-project/vllm-"
+"ascend/issues/4715), this is a simple ACLGraph graph mode acceleration "
+"solution based on Fx graphs."
+msgstr ""
+"如 [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715) 中所述，这是一个基于 Fx 图的简单 ACLGraph 图模式加速解决方案。"
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:7
+msgid "Using npugraph_ex"
+msgstr "使用 npugraph_ex"
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:9
+msgid ""
+"Npugraph_ex will be enabled by default in the future, Take Qwen series "
+"models as an example to show how to configure it."
+msgstr "Npugraph_ex 将在未来默认启用，以 Qwen 系列模型为例展示如何配置。"
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:11
+msgid "Offline example:"
+msgstr "离线示例："
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:28
+msgid "Online example:"
+msgstr "在线示例："
+
+#: ../../source/user_guide/feature_guide/npugraph_ex.md:35
+msgid ""
+"You can find more details about "
+"[npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)"
+msgstr ""
+"您可以在 [npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html) 找到更多详细信息。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/quantization.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/quantization.po
@@ -4,180 +4,201 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/feature_guide/quantization.md:1
+#: ../../source/user_guide/feature_guide/quantization.md:1
 msgid "Quantization Guide"
 msgstr "量化指南"

-#: ../../user_guide/feature_guide/quantization.md:3
+#: ../../source/user_guide/feature_guide/quantization.md:3
 msgid ""
-"Model quantization is a technique that reduces the size and computational "
-"requirements of a model by lowering the data precision of the weights and "
-"activation values in the model, thereby saving the memory and improving the "
-"inference speed."
+"Model quantization is a technique that reduces model size and "
+"computational overhead by lowering the numerical precision of weights and"
+" activations, thereby saving memory and improving inference speed."
 msgstr "模型量化是一种通过降低模型中权重和激活值的数据精度，从而减少模型大小和计算需求的技术，这样可以节省内存并提高推理速度。"

-#: ../../user_guide/feature_guide/quantization.md:5
+#: ../../source/user_guide/feature_guide/quantization.md:5
 msgid ""
-"Since 0.9.0rc2 version, quantization feature is experimentally supported in "
-"vLLM Ascend. Users can enable quantization feature by specifying "
-"`--quantization ascend`. Currently, only Qwen, DeepSeek series models are "
-"well tested. We’ll support more quantization algorithm and models in the "
-"future."
-msgstr ""
-"自 0.9.0rc2 版本起，vLLM Ascend 实验性地支持量化特性。用户可以通过指定 `--quantization ascend` "
-"启用量化功能。目前，只有 Qwen、DeepSeek 系列模型经过了充分测试。未来我们将支持更多的量化算法和模型。"
+"`vLLM Ascend` supports multiple quantization methods. This guide provides"
+" instructions for using different quantization tools and running "
+"quantized models on vLLM Ascend."
+msgstr "`vLLM Ascend` 支持多种量化方法。本指南提供了使用不同量化工具以及在 vLLM Ascend 上运行量化模型的说明。"

-#: ../../user_guide/feature_guide/quantization.md:7
-msgid "Install modelslim"
-msgstr "安装 modelslim"
+#: ../../source/user_guide/feature_guide/quantization.md:7
+msgid "**Note**"
+msgstr "**注意**"

-#: ../../user_guide/feature_guide/quantization.md:9
+#: ../../source/user_guide/feature_guide/quantization.md:9
+msgid ""
+"You can choose to convert the model yourself or use the quantized model "
+"we uploaded. See <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2"
+"-Instruct-W8A8>. Before you quantize a model, ensure sufficient RAM is "
+"available."
+msgstr "您可以选择自行转换模型，或使用我们上传的量化模型。请参阅 <https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8>。在对模型进行量化之前，请确保有足够的可用内存。"
+
+#: ../../source/user_guide/feature_guide/quantization.md:13
+msgid "Quantization Tools"
+msgstr "量化工具"
+
+#: ../../source/user_guide/feature_guide/quantization.md:15
+msgid ""
+"vLLM Ascend supports models quantized by two main tools: `ModelSlim` and "
+"`LLM-Compressor`."
+msgstr "vLLM Ascend 支持由两种主要工具量化的模型：`ModelSlim` 和 `LLM-Compressor`。"
+
+#: ../../source/user_guide/feature_guide/quantization.md:17
+msgid "1. ModelSlim (Recommended)"
+msgstr "1. ModelSlim (推荐)"
+
+#: ../../source/user_guide/feature_guide/quantization.md:19
 msgid ""
-"To quantize a model, users should install "
 "[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)"
-" which is the Ascend compression and acceleration tool. It is an affinity-"
-"based compression tool designed for acceleration, using compression as its "
-"core technology and built upon the Ascend platform."
-msgstr ""
-"要对模型进行量化，用户应安装[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)，这是昇腾的压缩与加速工具。它是一种基于亲和性的压缩工具，专为加速设计，以压缩为核心技术，并基于昇腾平台构建。"
+" is an Ascend-friendly compression tool focused on acceleration, using "
+"compression techniques, and built for Ascend hardware. It includes a "
+"series of inference optimization technologies such as quantization and "
+"compression, aiming to accelerate large language dense models, MoE "
+"models, multimodal understanding models, multimodal generation models, "
+"etc."
+msgstr "[ModelSlim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md) 是一款面向昇腾硬件的压缩工具，专注于加速，采用压缩技术构建。它包含一系列推理优化技术，如量化和压缩，旨在加速大型语言密集模型、MoE 模型、多模态理解模型、多模态生成模型等。"

-#: ../../user_guide/feature_guide/quantization.md:11
+#: ../../source/user_guide/feature_guide/quantization.md:21
+#: ../../source/user_guide/feature_guide/quantization.md:67
+msgid "Installation"
+msgstr "安装"
+
+#: ../../source/user_guide/feature_guide/quantization.md:23
 msgid ""
-"Currently, only the specific tag [modelslim-"
-"VLLM-8.1.RC1.b020_001](https://gitcode.com/Ascend/msit/blob/modelslim-"
-"VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM "
-"Ascend. Please do not install other version until modelslim master version "
-"is available for vLLM Ascend in the future."
-msgstr ""
-"目前，只有 modelslim 的特定标签 [modelslim-"
-"VLLM-8.1.RC1.b020_001](https://gitcode.com/Ascend/msit/blob/modelslim-"
-"VLLM-8.1.RC1.b020_001/msmodelslim/README.md) 支持 vLLM Ascend。在未来 modelslim "
-"的主版本支持 vLLM Ascend 之前，请不要安装其他版本。"
+"To use ModelSlim for model quantization, install it from its [Git "
+"repository](https://gitcode.com/Ascend/msit):"
+msgstr "要使用 ModelSlim 进行模型量化，请从其 [Git 仓库](https://gitcode.com/Ascend/msit) 安装："

-#: ../../user_guide/feature_guide/quantization.md:13
-msgid "Install modelslim:"
-msgstr "安装 modelslim："
+#: ../../source/user_guide/feature_guide/quantization.md:34
+#: ../../source/user_guide/feature_guide/quantization.md:73
+msgid "Model Quantization"
+msgstr "模型量化"

-#: ../../user_guide/feature_guide/quantization.md:21
-msgid "Quantize model"
-msgstr "量化模型"
-
-#: ../../user_guide/feature_guide/quantization.md:23
-#, python-format
+#: ../../source/user_guide/feature_guide/quantization.md:36
 msgid ""
-"Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-"
-"ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and"
-" then execute the convert command. The command is shown below. More info can"
-" be found in modelslim doc [deepseek w8a8 dynamic quantization "
-"docs](https://gitcode.com/Ascend/msit/blob/modelslim-"
-"VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96)."
-msgstr ""
-"以 [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-"
-"ai/DeepSeek-V2-Lite) 为例，你只需要下载模型，然后执行转换命令。命令如下所示。更多信息可参考 modelslim 文档 "
-"[deepseek w8a8 动态量化文档](https://gitcode.com/Ascend/msit/blob/modelslim-"
-"VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96)。"
+"The following example shows how to generate W8A8 quantized weights for "
+"the [Qwen3-MoE "
+"model](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md)."
+msgstr "以下示例展示了如何为 [Qwen3-MoE 模型](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/Qwen3-MOE/README.md) 生成 W8A8 量化权重。"

-#: ../../user_guide/feature_guide/quantization.md:32
+#: ../../source/user_guide/feature_guide/quantization.md:38
+msgid "**Quantization Script:**"
+msgstr "**量化脚本：**"
+
+#: ../../source/user_guide/feature_guide/quantization.md:59
 msgid ""
-"You can also download the quantized model that we uploaded. Please note that"
-" these weights should be used for test only. For example, "
-"https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8"
-msgstr ""
-"你也可以下载我们上传的量化模型。请注意，这些权重仅应用于测试。例如：https://www.modelscope.cn/models/vllm-"
-"ascend/DeepSeek-V2-Lite-W8A8"
+"After quantization completes, the output directory will contain the "
+"quantized model files."
+msgstr "量化完成后，输出目录将包含量化后的模型文件。"

-#: ../../user_guide/feature_guide/quantization.md:35
-msgid "Once convert action is done, there are two important files generated."
-msgstr "转换操作完成后，会生成两个重要的文件。"
-
-#: ../../user_guide/feature_guide/quantization.md:37
+#: ../../source/user_guide/feature_guide/quantization.md:61
 msgid ""
-"[config.json](https://www.modelscope.cn/models/vllm-"
-"ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please "
-"make sure that there is no `quantization_config` field in it."
-msgstr ""
-"[config.json](https://www.modelscope.cn/models/vllm-"
-"ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1)。请确保其中没有 "
-"`quantization_config` 字段。"
+"For more examples, refer to the [official "
+"examples](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example)."
+msgstr "更多示例，请参考 [官方示例](https://gitcode.com/Ascend/msit/tree/master/msmodelslim/example)。"

-#: ../../user_guide/feature_guide/quantization.md:39
+#: ../../source/user_guide/feature_guide/quantization.md:63
+msgid "2. LLM-Compressor"
+msgstr "2. LLM-Compressor"
+
+#: ../../source/user_guide/feature_guide/quantization.md:65
 msgid ""
-"[quant_model_description.json](https://www.modelscope.cn/models/vllm-"
-"ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1)."
-" All the converted weights info are recorded in this file."
-msgstr ""
-"[quant_model_description.json](https://www.modelscope.cn/models/vllm-"
-"ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1)。所有被转换的权重信息都记录在该文件中。"
+"[LLM-Compressor](https://github.com/vllm-project/llm-compressor) is a "
+"unified compressed model library for faster vLLM inference."
+msgstr "[LLM-Compressor](https://github.com/vllm-project/llm-compressor) 是一个统一的压缩模型库，用于加速 vLLM 推理。"

-#: ../../user_guide/feature_guide/quantization.md:41
-msgid "Here is the full converted model files:"
-msgstr "以下是完整转换后的模型文件："
+#: ../../source/user_guide/feature_guide/quantization.md:75
+msgid "`LLM-Compressor` provides various quantization scheme examples."
+msgstr "`LLM-Compressor` 提供了多种量化方案的示例。"

-#: ../../user_guide/feature_guide/quantization.md:60
-msgid "Run the model"
-msgstr "运行模型"
+#: ../../source/user_guide/feature_guide/quantization.md:77
+msgid "Dense Quantization"
+msgstr "密集模型量化"

-#: ../../user_guide/feature_guide/quantization.md:62
+#: ../../source/user_guide/feature_guide/quantization.md:79
+msgid "An example to generate W8A8 dynamic quantized weights for dense model:"
+msgstr "为密集模型生成 W8A8 动态量化权重的示例："
+
+#: ../../source/user_guide/feature_guide/quantization.md:89
+msgid "MoE Quantization"
+msgstr "MoE 模型量化"
+
+#: ../../source/user_guide/feature_guide/quantization.md:91
+msgid "An example to generate W8A8 dynamic quantized weights for MoE model:"
+msgstr "为 MoE 模型生成 W8A8 动态量化权重的示例："
+
+#: ../../source/user_guide/feature_guide/quantization.md:101
 msgid ""
-"Now, you can run the quantized models with vLLM Ascend. Here is the example "
-"for online and offline inference."
-msgstr "现在，你可以使用 vLLM Ascend 运行量化模型。下面是在线和离线推理的示例。"
+"For more content, refer to the [official examples](https://github.com"
+"/vllm-project/llm-compressor/tree/main/examples)."
+msgstr "更多内容，请参考 [官方示例](https://github.com/vllm-project/llm-compressor/tree/main/examples)。"

-#: ../../user_guide/feature_guide/quantization.md:64
-msgid "Offline inference"
+#: ../../source/user_guide/feature_guide/quantization.md:103
+msgid ""
+"Currently supported quantization types by LLM-Compressor: `W8A8` and "
+"`W8A8_DYNAMIC`."
+msgstr "LLM-Compressor 当前支持的量化类型：`W8A8` 和 `W8A8_DYNAMIC`。"
+
+#: ../../source/user_guide/feature_guide/quantization.md:105
+msgid "Running Quantized Models"
+msgstr "运行量化模型"
+
+#: ../../source/user_guide/feature_guide/quantization.md:107
+msgid ""
+"Once you have a quantized model which is generated by **ModelSlim**, you "
+"can use vLLM Ascend for inference by specifying the `--quantization "
+"ascend` parameter to enable quantization features, while for models "
+"quantized by **LLM-Compressor**, do not need to add this parameter."
+msgstr "一旦您拥有由 **ModelSlim** 生成的量化模型，您可以通过指定 `--quantization ascend` 参数来使用 vLLM Ascend 进行推理以启用量化功能。而对于由 **LLM-Compressor** 量化的模型，则无需添加此参数。"
+
+#: ../../source/user_guide/feature_guide/quantization.md:109
+msgid "Offline Inference"
 msgstr "离线推理"

-#: ../../user_guide/feature_guide/quantization.md:90
-msgid "Online inference"
+#: ../../source/user_guide/feature_guide/quantization.md:143
+msgid "Online Inference"
 msgstr "在线推理"

-#: ../../user_guide/feature_guide/quantization.md:97
-msgid "FAQs"
-msgstr "常见问题解答"
+#: ../../source/user_guide/feature_guide/quantization.md:158
+msgid "References"
+msgstr "参考"

-#: ../../user_guide/feature_guide/quantization.md:99
+#: ../../source/user_guide/feature_guide/quantization.md:160
 msgid ""
-"1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' "
-"problem?"
-msgstr "1. 如何解决 KeyError: 'xxx.layers.0.self_attn.q_proj.weight' 问题？"
+"[ModelSlim "
+"Documentation](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)"
+msgstr "[ModelSlim 文档](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/README.md)"

-#: ../../user_guide/feature_guide/quantization.md:101
-msgid ""
-"First, make sure you specify `ascend` quantization method. Second, check if "
-"your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim "
-"version. Finally, if it still doesn't work, please submit a issue, maybe "
-"some new models need to be adapted."
-msgstr ""
-"首先，请确保你指定了 `ascend` 量化方法。其次，检查你的模型是否由 `modelslim-VLLM-8.1.RC1.b020_001` 这个 "
-"modelslim 版本转换。如果仍然无法使用，请提交一个 issue，可能有一些新模型需要适配。"
+#: ../../source/user_guide/feature_guide/quantization.md:161
+msgid "[LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor)"
+msgstr "[LLM-Compressor GitHub](https://github.com/vllm-project/llm-compressor)"

-#: ../../user_guide/feature_guide/quantization.md:104
-msgid ""
-"2. How to solve the error \"Could not locate the "
-"configuration_deepseek.py\"?"
-msgstr "2. 如何解决“无法找到 configuration_deepseek.py”错误？"
+#: ../../source/user_guide/feature_guide/quantization.md:162
+msgid "[vLLM Quantization Guide](https://docs.vllm.ai/en/latest/quantization/)"
+msgstr "[vLLM 量化指南](https://docs.vllm.ai/en/latest/quantization/)"

-#: ../../user_guide/feature_guide/quantization.md:106
-msgid ""
-"Please convert DeepSeek series models using `modelslim-"
-"VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing "
-"configuration_deepseek.py error."
-msgstr ""
-"请使用 `modelslim-VLLM-8.1.RC1.b020_001` 的 modelslim 转换 DeepSeek 系列模型，该版本已修复缺少 "
-"configuration_deepseek.py 的错误。"
+#~ msgid ""
+#~ "Please convert DeepSeek series models "
+#~ "using `modelslim-VLLM-8.1.RC1.b020_001` modelslim,"
+#~ " this version has fixed the missing"
+#~ " configuration_deepseek.py error."
+#~ msgstr ""
+#~ "请使用 `modelslim-VLLM-8.1.RC1.b020_001` 版本的 "
+#~ "modelslim 转换 DeepSeek 系列模型，该版本已修复缺少 "
+#~ "configuration_deepseek.py 文件的错误。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/rfork.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/rfork.po
@@ -0,0 +1,386 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/rfork.md:1
+msgid "RFork Guide"
+msgstr "RFork 指南"
+
+#: ../../source/user_guide/feature_guide/rfork.md:3
+msgid ""
+"This guide explains how to use **RFork** as a model-loader plugin in "
+"**vLLM Ascend**."
+msgstr "本指南介绍如何在 **vLLM Ascend** 中使用 **RFork** 作为模型加载器插件。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:7
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/rfork.md:9
+msgid ""
+"RFork is a warm-start weight loading path for vLLM Ascend. Instead of "
+"always reading model weights from storage, a new instance can request a "
+"compatible **seed** instance from an external planner, then pull weights "
+"directly from that seed through `YuanRong TransferEngine`."
+msgstr "RFork 是 vLLM Ascend 的一种热启动权重加载路径。新实例无需总是从存储中读取模型权重，而是可以从外部规划器请求一个兼容的 **种子** 实例，然后通过 `YuanRong TransferEngine` 直接从该种子拉取权重。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:11
+msgid "The RFork loading flow in the current implementation is:"
+msgstr "当前实现中，RFork 的加载流程如下："
+
+#: ../../source/user_guide/feature_guide/rfork.md:13
+msgid "vLLM starts with `--load-format rfork`."
+msgstr "vLLM 以 `--load-format rfork` 参数启动。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:14
+msgid ""
+"RFork builds a **seed key** from the model identity and deployment "
+"topology."
+msgstr "RFork 根据模型标识和部署拓扑构建一个 **种子键**。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:15
+msgid "RFork asks the planner for an available seed matching that key."
+msgstr "RFork 向规划器请求一个与该键匹配的可用种子。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:16
+msgid ""
+"If a seed is returned, the new instance initializes the model structure "
+"on its local NPU, registers local weight memory, fetches the remote "
+"transfer-engine metadata from the seed, and performs batch weight "
+"transfer into local parameter buffers."
+msgstr "如果返回了一个种子，新实例将在其本地 NPU 上初始化模型结构，注册本地权重内存，从种子获取远程传输引擎的元数据，并执行批量权重传输到本地参数缓冲区。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:17
+msgid ""
+"If no seed is available, or any step fails, RFork cleans up and falls "
+"back to the default loader."
+msgstr "如果没有可用种子，或任何步骤失败，RFork 将进行清理并回退到默认加载器。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:18
+msgid ""
+"After the instance finishes loading, it starts a local seed service and "
+"periodically reports heartbeat to the planner, so later instances can "
+"reuse it."
+msgstr "实例完成加载后，会启动一个本地种子服务，并定期向规划器发送心跳，以便后续实例可以复用该实例。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:20
+msgid "Flowchart"
+msgstr "流程图"
+
+#: ../../source/user_guide/feature_guide/rfork.md:22
+msgid "![rfork flowchart](./images/rfork_flowchart.jpg)"
+msgstr "![rfork 流程图](./images/rfork_flowchart.jpg)"
+
+#: ../../source/user_guide/feature_guide/rfork.md:22
+msgid "rfork flowchart"
+msgstr "rfork 流程图"
+
+#: ../../source/user_guide/feature_guide/rfork.md:24
+msgid "Application Scenarios"
+msgstr "应用场景"
+
+#: ../../source/user_guide/feature_guide/rfork.md:26
+msgid ""
+"**Scale-out after a first successful load**: The first instance may still"
+" load from storage, but later instances with the same deployment identity"
+" can reuse it as a seed and shorten startup time."
+msgstr "**首次成功加载后的横向扩展**：第一个实例可能仍需从存储加载，但后续具有相同部署标识的实例可以将其作为种子复用，从而缩短启动时间。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:27
+msgid ""
+"**Elastic serving clusters**: Because RFork asks a planner for available "
+"seeds, it fits clusters where instances are created and reclaimed "
+"dynamically."
+msgstr "**弹性服务集群**：由于 RFork 会向规划器请求可用种子，因此它适用于实例动态创建和回收的集群。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:28
+msgid ""
+"**Topology-sensitive deployments**: RFork encodes `kv_role`, `node_rank`,"
+" `tp_rank`, and optional `draft` role into the seed key, so only "
+"topology-compatible instances are matched together."
+msgstr "**拓扑敏感的部署**：RFork 将 `kv_role`、`node_rank`、`tp_rank` 以及可选的 `draft` 角色编码到种子键中，因此只有拓扑兼容的实例才会被匹配在一起。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:32
+msgid "Usage"
+msgstr "使用方法"
+
+#: ../../source/user_guide/feature_guide/rfork.md:34
+msgid ""
+"To enable RFork, pass `--load-format rfork` and provide RFork settings "
+"through `--model-loader-extra-config` as a JSON string."
+msgstr "要启用 RFork，请传递 `--load-format rfork` 参数，并通过 `--model-loader-extra-config` 以 JSON 字符串的形式提供 RFork 设置。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:36
+msgid "RFork Prerequisites"
+msgstr "RFork 先决条件"
+
+#: ../../source/user_guide/feature_guide/rfork.md:38
+msgid ""
+"Install the runtime dependency `YuanRong TransferEngine` on every RFork "
+"instance."
+msgstr "在每个 RFork 实例上安装运行时依赖 `YuanRong TransferEngine`。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:39
+msgid ""
+"Run a planner service that implements the RFork seed protocol. A simple "
+"mock planner script is provided at "
+"[`rfork_planner.py`](../../../../examples/rfork/rfork_planner.py)."
+msgstr "运行一个实现了 RFork 种子协议的规划器服务。在 [`rfork_planner.py`](../../../../examples/rfork/rfork_planner.py) 提供了一个简单的模拟规划器脚本。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:41
+msgid "Configuration Fields"
+msgstr "配置字段"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Field Name"
+msgstr "字段名"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Type"
+msgstr "类型"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Description"
+msgstr "描述"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Allowed Values / Notes"
+msgstr "允许值 / 备注"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "**model_url**"
+msgstr "**model_url**"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "String"
+msgstr "字符串"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Logical model identifier used to build the RFork seed key."
+msgstr "用于构建 RFork 种子键的逻辑模型标识符。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid ""
+"Required for RFork transfer. Instances that should share seeds must use "
+"the same value."
+msgstr "RFork 传输所必需。应共享种子的实例必须使用相同的值。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "**model_deploy_strategy_name**"
+msgstr "**model_deploy_strategy_name**"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid ""
+"Deployment strategy identifier used together with `model_url` to build "
+"the seed key."
+msgstr "部署策略标识符，与 `model_url` 一起用于构建种子键。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "**rfork_scheduler_url**"
+msgstr "**rfork_scheduler_url**"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid ""
+"Base URL of the planner service used for seed allocation, release, and "
+"heartbeat."
+msgstr "用于种子分配、释放和心跳的规划器服务的基础 URL。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Required for planner-based matching. Example: `http://127.0.0.1:1223`."
+msgstr "基于规划器的匹配所必需。示例：`http://127.0.0.1:1223`。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "**rfork_seed_timeout_sec**"
+msgstr "**rfork_seed_timeout_sec**"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Number"
+msgstr "数字"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid ""
+"Timeout for waiting until the local seed HTTP service becomes healthy "
+"after startup."
+msgstr "启动后等待本地种子 HTTP 服务变为健康状态的超时时间。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Optional. Default: `30`. Must be greater than `0`."
+msgstr "可选。默认值：`30`。必须大于 `0`。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "**rfork_seed_key_separator**"
+msgstr "**rfork_seed_key_separator**"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Separator used when building the RFork seed key string."
+msgstr "构建 RFork 种子键字符串时使用的分隔符。"
+
+#: ../../source/user_guide/feature_guide/rfork.md
+msgid "Optional. Default: `$`. Keep the same value across compatible instances."
+msgstr "可选。默认值：`$`。在兼容的实例间保持相同的值。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:51
+msgid "How RFork Matches Seeds"
+msgstr "RFork 如何匹配种子"
+
+#: ../../source/user_guide/feature_guide/rfork.md:53
+msgid ""
+"RFork does not match instances by `model_url` alone. The local seed key "
+"is composed from:"
+msgstr "RFork 不仅通过 `model_url` 来匹配实例。本地种子键由以下部分组成："
+
+#: ../../source/user_guide/feature_guide/rfork.md:55
+msgid "`model_url`"
+msgstr "`model_url`"
+
+#: ../../source/user_guide/feature_guide/rfork.md:56
+msgid "`model_deploy_strategy_name`"
+msgstr "`model_deploy_strategy_name`"
+
+#: ../../source/user_guide/feature_guide/rfork.md:57
+msgid "disaggregation mode derived from `kv_transfer_config.kv_role` or `kv_both`"
+msgstr "从 `kv_transfer_config.kv_role` 或 `kv_both` 派生的解耦模式"
+
+#: ../../source/user_guide/feature_guide/rfork.md:58
+msgid "`node_rank`"
+msgstr "`node_rank`"
+
+#: ../../source/user_guide/feature_guide/rfork.md:59
+msgid "`tp_rank`"
+msgstr "`tp_rank`"
+
+#: ../../source/user_guide/feature_guide/rfork.md:60
+msgid "optional `draft` suffix when the worker runs as a draft model"
+msgstr "当工作器作为草稿模型运行时，可选的 `draft` 后缀"
+
+#: ../../source/user_guide/feature_guide/rfork.md:62
+msgid ""
+"This means two instances must agree on both model identity and deployment"
+" topology before the planner will treat them as interchangeable seeds."
+msgstr "这意味着两个实例必须在模型标识和部署拓扑上都达成一致，规划器才会将它们视为可互换的种子。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:66
+msgid "Example Commands & Placeholders"
+msgstr "示例命令与占位符"
+
+#: ../../source/user_guide/feature_guide/rfork.md:68
+msgid "Replace parts in `` `<...>` `` before running."
+msgstr "运行前替换 `` `<...>` `` 中的部分。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:70
+msgid "1. Install YuanRong TransferEngine"
+msgstr "1. 安装 YuanRong TransferEngine"
+
+#: ../../source/user_guide/feature_guide/rfork.md:76
+msgid "2. Start the Planner"
+msgstr "2. 启动规划器"
+
+#: ../../source/user_guide/feature_guide/rfork.md:78
+msgid ""
+"A simple planner implementation is provided at "
+"[`rfork_planner.py`](../../../../examples/rfork/rfork_planner.py)."
+msgstr "在 [`rfork_planner.py`](../../../../examples/rfork/rfork_planner.py) 提供了一个简单的规划器实现。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:86
+msgid "3. Start vLLM Instances"
+msgstr "3. 启动 vLLM 实例"
+
+#: ../../source/user_guide/feature_guide/rfork.md:88
+msgid ""
+"Use the same RFork startup command for both the first instance and later "
+"instances in the same deployment."
+msgstr "对于同一部署中的第一个实例和后续实例，使用相同的 RFork 启动命令。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:90
+msgid ""
+"For the first instance, the planner usually has no compatible seed yet, "
+"so RFork falls back to the default loader. After loading finishes, that "
+"instance starts its local seed service and reports itself to the planner."
+msgstr "对于第一个实例，规划器通常还没有兼容的种子，因此 RFork 会回退到默认加载器。加载完成后，该实例会启动其本地种子服务，并向规划器报告自身。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:92
+msgid ""
+"For later instances, if the planner can allocate a compatible seed, RFork"
+" will try to transfer weights from the existing seed instance before "
+"falling back to the default loader."
+msgstr "对于后续实例，如果规划器能分配一个兼容的种子，RFork 将尝试从现有的种子实例传输权重，然后再回退到默认加载器。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:109
+msgid "Placeholder Descriptions"
+msgstr "占位符说明"
+
+#: ../../source/user_guide/feature_guide/rfork.md:111
+msgid "`<model_path>`: Model path or model identifier passed to `vllm serve`."
+msgstr "`<model_path>`：传递给 `vllm serve` 的模型路径或模型标识符。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:112
+msgid "`<served_model_name>`: Service name exposed by vLLM."
+msgstr "`<served_model_name>`：vLLM 暴露的服务名称。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:113
+msgid "`<planner_ip>`: IP address or hostname of the RFork planner."
+msgstr "`<planner_ip>`：RFork 规划器的 IP 地址或主机名。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:114
+msgid "`<planner_port>`: Listening port of the RFork planner."
+msgstr "`<planner_port>`：RFork 规划器的监听端口。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:115
+msgid ""
+"`<model_url>`: Stable model identity string used to build the RFork seed "
+"key."
+msgstr "`<model_url>`：用于构建 RFork 种子键的稳定模型标识字符串。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:116
+msgid ""
+"`<deploy_strategy>`: Stable deployment-strategy name used to build the "
+"RFork seed key."
+msgstr "`<deploy_strategy>`：用于构建 RFork 种子键的稳定部署策略名称。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:117
+msgid "`<port>`: Serving port of the vLLM instance being started."
+msgstr "`<port>`：正在启动的 vLLM 实例的服务端口。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:121
+msgid "Note & Caveats"
+msgstr "注意事项与限制"
+
+#: ../../source/user_guide/feature_guide/rfork.md:123
+msgid ""
+"RFork requires `YuanRong TransferEngine` at runtime. If the package is "
+"missing, RFork cannot initialize the transfer backend."
+msgstr "RFork 在运行时需要 `YuanRong TransferEngine`。如果缺少该软件包，RFork 将无法初始化传输后端。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:124
+msgid ""
+"If RFORK is used, **each worker process** must bind a listening port. "
+"That port is assigned randomly."
+msgstr ""
+"如果使用 RFORK，**每个工作进程**都必须绑定一个监听端口。该端口是随机分配的。"
+
+#: ../../source/user_guide/feature_guide/rfork.md:125
+msgid ""
+"The example "
+"[`rfork_planner.py`](../../../../examples/rfork/rfork_planner.py) is only"
+" a simple mock implementation. If you need stronger scheduling, capacity "
+"management, or production-grade availability behavior, implement your own"
+" planner based on the RFork seed protocol."
+msgstr ""
+"示例 [`rfork_planner.py`](../../../../examples/rfork/rfork_planner.py) 仅是一个简单的模拟实现。如果您需要更强大的调度、容量管理或生产级可用性行为，请基于 RFork 种子协议实现您自己的规划器。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sequence_parallelism.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sequence_parallelism.po
@@ -0,0 +1,435 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:1
+msgid "Sequence Parallelism"
+msgstr "序列并行"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:3
+msgid "What is Sequence Parallelism"
+msgstr "什么是序列并行"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:5
+msgid ""
+"Sequence Parallelism (SP) was first introduced in "
+"[Megatron](https://arxiv.org/pdf/2205.05198), with the original intention"
+" of reducing training activation memory. The core modification was "
+"changing `Allreduce->LayerNorm` to `ReduceScatter->LayerNorm->Allgather`."
+" This technique was later applied to inference by vllm. It should be "
+"noted that splitting Allreduce into ReduceScatter and Allgather does not "
+"inherently bring performance benefits; it reduces the computation load of"
+" LayerNorm, but this gain is minimal. The real benefits of SP come from:"
+msgstr ""
+"序列并行（Sequence Parallelism，SP）最初由 "
+"[Megatron](https://arxiv.org/pdf/2205.05198) 提出，其初衷是减少训练时的激活内存。核心改动是将 "
+"`Allreduce->LayerNorm` 改为 `ReduceScatter->LayerNorm->Allgather`。这项技术后来被 vllm "
+"应用于推理。需要注意的是，将 Allreduce 拆分为 ReduceScatter 和 Allgather 本身并不会带来性能收益；它减少了 LayerNorm "
+"的计算量，但这种收益微乎其微。SP 的真正收益来自："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:7
+msgid ""
+"LLM inference deployment often uses quantization. Taking INT8 "
+"quantization commonly used on NPUs as an example, after LayerNorm, a "
+"Quant operator quantizes the hidden states from BF16 to INT8. The "
+"communication volume of Allgather is halved, and the time consumption is "
+"almost halved."
+msgstr ""
+"LLM 推理部署常使用量化。以 NPU 上常用的 INT8 量化为例，在 LayerNorm 之后，Quant 算子会将隐藏状态从 BF16 量化为 INT8。此时 "
+"Allgather 的通信量减半，耗时也几乎减半。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:8
+msgid ""
+"ReduceScatter and Allgather can be fused with the preceding and following"
+" Matmul operations respectively into communication-computation parallel "
+"operators, reducing latency."
+msgstr "ReduceScatter 和 Allgather 可以分别与前后 Matmul 操作融合为通信-计算并行算子，从而降低延迟。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:10
+msgid "How to Use"
+msgstr "如何使用"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:12
+msgid ""
+"Currently, vllm-ascend has implemented Sequence Parallelism for VL-class "
+"models based on the Inductor pass. It can be enabled in the following "
+"way:"
+msgstr "目前，vllm-ascend 已基于 Inductor pass 为 VL 类模型实现了序列并行。可以通过以下方式启用："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:20
+msgid ""
+"`\"enable_sp\"`: This is the switch for SP. Since SP relies on graph "
+"mode, it is not supported in eager mode."
+msgstr "`\"enable_sp\"`：这是 SP 的开关。由于 SP 依赖于图模式，因此在 eager 模式下不受支持。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:21
+#, python-brace-format
+msgid ""
+"`sp_min_token_num` (from upstream vllm's `pass_config`): Based on our "
+"experiments, when the number of tokens is small (empirical value is less "
+"than 1000), SP can actually bring negative impact. This is because when "
+"the communication volume is small, the fixed overhead of the "
+"communication operator becomes the dominant factor. SP will only take "
+"effect when `num_tokens >= sp_min_token_num`. **The default value is 1000"
+" on Ascend, which generally does not need to be modified.** To customize,"
+" use `--compilation-config '{\"pass_config\": {\"enable_sp\": true, "
+"\"sp_min_token_num\": 512}}'`. The value will be appended into "
+"`compile_ranges_split_points`, which splits the graph compilation range "
+"and checks whether the pass is applicable per range."
+msgstr ""
+"`sp_min_token_num`（来自上游 vllm 的 `pass_config`）：根据我们的实验，当 token 数量较少（经验值小于 1000）时，SP "
+"实际上可能带来负面影响。这是因为当通信量较小时，通信算子的固定开销成为主导因素。SP 仅在 `num_tokens >= sp_min_token_num` "
+"时生效。**在 Ascend 上默认值为 1000，通常无需修改。** 如需自定义，请使用 `--compilation-config '{\"pass_config\": "
+"{\"enable_sp\": true, \"sp_min_token_num\": 512}}'`。该值将被追加到 `compile_ranges_split_points` "
+"中，用于分割图编译范围，并检查每个范围是否适用该 pass。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:23
+msgid ""
+"Without modifying `sp_min_token_num`, the simplest way and recommended "
+"way to enable SP is:"
+msgstr "在不修改 `sp_min_token_num` 的情况下，启用 SP 最简单且推荐的方式是："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:31
+msgid "Difference Between SP and Flash Comm V1"
+msgstr "SP 与 Flash Comm V1 的区别"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:33
+msgid ""
+"[Flash Comm V1 (FC1)](https://gitcode.com/ascend-tribe/ascend-inference-"
+"cluster/blob/main/FlashComm/ascend-inference-cluster-flashcomm.md) is an "
+"enhanced version of Sequence Parallelism developed based on NPU. The "
+"enhancements include:"
+msgstr ""
+"[Flash Comm V1 (FC1)](https://gitcode.com/ascend-tribe/ascend-inference-"
+"cluster/blob/main/FlashComm/ascend-inference-cluster-flashcomm.md) 是基于 NPU "
+"开发的序列并行增强版本。其增强包括："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:35
+msgid ""
+"For models using the MLA structure, Allgather is postponed until after "
+"QKV projection, further reducing communication volume."
+msgstr "对于使用 MLA 结构的模型，Allgather 被推迟到 QKV 投影之后，进一步减少了通信量。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:36
+msgid ""
+"For MoE models, Allgather is postponed until after Gating+DynamicQuant, "
+"also aiming to reduce communication volume."
+msgstr "对于 MoE 模型，Allgather 被推迟到 Gating+DynamicQuant 之后，同样旨在减少通信量。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:38
+msgid ""
+"FC1 is a unique optimization in vllm-ascend, currently implemented based "
+"on Custom OP, but it is difficult to support VL-class models (reasons "
+"detailed in [[RFC]: support sequence parallelism by "
+"pass](https://github.com/vllm-project/vllm-ascend/issues/5712) ). "
+"Therefore, currently FC1 and SP are complementary."
+msgstr ""
+"FC1 是 vllm-ascend 中独特的优化，目前基于 Custom OP 实现，但难以支持 VL 类模型（原因详见 [[RFC]: support "
+"sequence parallelism by "
+"pass](https://github.com/vllm-project/vllm-ascend/issues/5712)）。因此，目前 FC1 和 SP "
+"是互补的。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:40
+msgid "Support Matrix"
+msgstr "支持矩阵"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:42
+msgid "Without Quantization"
+msgstr "无量化"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "VL + Dense"
+msgstr "VL + 稠密"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "VL + MoE"
+msgstr "VL + MoE"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "non-VL + Dense"
+msgstr "非 VL + 稠密"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "non-VL + MoE"
+msgstr "非 VL + MoE"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "graph"
+msgstr "图模式"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "x"
+msgstr "x"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "Flash Comm V1"
+msgstr "Flash Comm V1"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "eager/graph"
+msgstr "eager/图模式"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:49
+msgid "With Quantization"
+msgstr "带量化"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:51
+msgid "SP currently does not support quantization and is under adaptation."
+msgstr "SP 目前不支持量化，正在适配中。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:58
+msgid "Pass Design"
+msgstr "Pass 设计"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:60
+msgid ""
+"When SP is enabled, the following passes run in order: "
+"`SequenceParallelismPass` then `SequenceParallelismMoePass`."
+msgstr "启用 SP 时，以下 pass 按顺序运行：先 `SequenceParallelismPass`，然后 `SequenceParallelismMoePass`。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:62
+msgid "SequenceParallelismPass"
+msgstr "SequenceParallelismPass"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:64
+msgid ""
+"Runs `NoOpEliminationPass` first to eliminate redundant view-like "
+"operations, then applies AllReduce-based patterns:"
+msgstr "首先运行 `NoOpEliminationPass` 以消除冗余的类视图操作，然后应用基于 AllReduce 的模式："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "Pattern"
+msgstr "模式"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "Match"
+msgstr "匹配"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "Replacement"
+msgstr "替换"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`MiddleAllReduceRMSNormPattern`"
+msgstr "`MiddleAllReduceRMSNormPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`all_reduce` + `layernorm`"
+msgstr "`all_reduce` + `layernorm`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`reduce_scatter` + `layernorm` + `all_gather`"
+msgstr "`reduce_scatter` + `layernorm` + `all_gather`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`LastAllReduceRMSNormPattern`"
+msgstr "`LastAllReduceRMSNormPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "Same (last layer, no residual)"
+msgstr "相同（最后一层，无残差）"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "Same"
+msgstr "相同"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`Qwen3VLMiddleAllReduceRMSNormPattern`"
+msgstr "`Qwen3VLMiddleAllReduceRMSNormPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`all_reduce` + add + `layernorm`"
+msgstr "`all_reduce` + add + `layernorm`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid ""
+"`reduce_scatter` + chunk(`deepstack_input_embeds`) + add + `layernorm` + "
+"`all_gather`"
+msgstr "`reduce_scatter` + chunk(`deepstack_input_embeds`) + add + `layernorm` + `all_gather`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:72
+msgid ""
+"**Why Qwen3 VL needs special handling by "
+"Qwen3VLMiddleAllReduceRMSNormPattern**"
+msgstr "**为什么 Qwen3 VL 需要 Qwen3VLMiddleAllReduceRMSNormPattern 特殊处理**"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:74
+msgid ""
+"Qwen3-VL middle layers insert an extra add between `all_reduce` and "
+"`layernorm`: `hidden_states=hidden_states + deepstack_input_embeds`. "
+"Under SP, `hidden_states` (i.e., `input`) is reduced-scattered to shape "
+"`[seq_len/tp, hidden]` per rank, while `deepstack_input_embeds` comes "
+"from the vision/deepstack path and stays full-sequence `[seq_len, "
+"hidden]` (typically replicated across TP ranks). Simply doing "
+"`reduce_scatter(input) + deepstack_input_embeds` would cause a shape "
+"mismatch. The fix is to chunk `deepstack_input_embeds` by `tp_size` so "
+"each rank uses `add(reduce_scatter, "
+"chunk(deepstack_input_embeds)[tp_rank])`, keeping shapes consistent "
+"before `layernorm` and `all_gather`."
+msgstr ""
+"Qwen3-VL 的中间层在 `all_reduce` 和 `layernorm` 之间插入了一个额外的 add 操作：`hidden_states=hidden_states "
+"+ deepstack_input_embeds`。在 SP 下，`hidden_states`（即 `input`）被 reduce-scatter "
+"到每个 rank 的形状 `[seq_len/tp, hidden]`，而 `deepstack_input_embeds` 来自视觉/deepstack "
+"路径，并保持全序列形状 `[seq_len, hidden]`（通常在 TP rank 间复制）。简单地执行 `reduce_scatter(input) + "
+"deepstack_input_embeds` 会导致形状不匹配。解决方法是按 `tp_size` 对 `deepstack_input_embeds` 进行 "
+"chunk，使得每个 rank 使用 `add(reduce_scatter, chunk(deepstack_input_embeds)[tp_rank])`，从而在 "
+"`layernorm` 和 `all_gather` 之前保持形状一致。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:77
+msgid "SequenceParallelismMoePass"
+msgstr "SequenceParallelismMoePass"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:79
+msgid ""
+"After `SequenceParallelismPass` applies, the MoE model computation graph "
+"looks like:"
+msgstr "应用 `SequenceParallelismPass` 后，MoE 模型的计算图如下所示："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:81
+msgid "![AllGather EP computation graph](../../assets/sp_moe.png)"
+msgstr "![AllGather EP 计算图](../../assets/sp_moe.png)"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:81
+msgid "AllGather EP computation graph"
+msgstr "AllGather EP 计算图"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:83
+msgid "**Overview**"
+msgstr "**概述**"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:85
+msgid ""
+"**Postponing allgather**: Under SP, `residual` is chunked by tensor "
+"parallelism. This causes a shape mismatch between hidden states and "
+"residual in the next layer's layernorm: hidden states are gathered (full "
+"sequence) while residual remains chunked. The fix is to move `all_gather`"
+" to *after* layernorm so that layernorm operates on consistent shapes per"
+" rank. `MiddleLayerAllgatherAddRMSNormPattern`, "
+"`LastLayerAllgatherRMSNormPattern`, and "
+"`Qwen3VLMiddleLayerAllgatherAddRMSNormPattern` are designed for this "
+"purpose, each handling different layer and structure variants (see the "
+"table below)."
+msgstr ""
+"**推迟 allgather**：在 SP 下，`residual` 被张量并行切分。这导致下一层 layernorm 中隐藏状态和残差的形状不匹配：隐藏状态被聚集（全序列），而残差保持切分状态。解决方法是将 "
+"`all_gather` 移动到 layernorm *之后*，使得 layernorm 在每个 rank 上操作一致的形状。`MiddleLayerAllgatherAddRMSNormPattern`、`LastLayerAllgatherRMSNormPattern` "
+"和 `Qwen3VLMiddleLayerAllgatherAddRMSNormPattern` 就是为此设计的，每个处理不同的层和结构变体（见下表）。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:87
+msgid ""
+"**AllGatherChunkNoOp cleanup**: When MoE SP is enabled, vllm introduces a"
+" `sequence_parallel_chunk` op (corresponding to `sp_chunk` in the "
+"diagram). Together with the preceding `all_gather`, the pair forms a "
+"redundant no-op (all_gather gathers, then chunk re-splits). "
+"`AllGatherChunkNoOpPattern` replaces this pair with identity to eliminate"
+" the redundant communication and computation."
+msgstr ""
+"**AllGatherChunkNoOp 清理**：当启用 MoE SP 时，vllm 引入了一个 `sequence_parallel_chunk` 算子（对应图中的 "
+"`sp_chunk`）。它与前面的 `all_gather` 一起形成了一个冗余的无操作（all_gather 聚集，然后 chunk 重新分割）。`AllGatherChunkNoOpPattern` "
+"将这对操作替换为恒等操作，以消除冗余的通信和计算。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:89
+msgid "**Pattern details:**"
+msgstr "**模式详情：**"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`MiddleLayerAllgatherAddRMSNormPattern`"
+msgstr "`MiddleLayerAllgatherAddRMSNormPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`all_gather` + slice + `layernorm`"
+msgstr "`all_gather` + slice + `layernorm`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`layernorm` + `all_gather`"
+msgstr "`layernorm` + `all_gather`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`LastLayerAllgatherRMSNormPattern`"
+msgstr "`LastLayerAllgatherRMSNormPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`Qwen3VLMiddleLayerAllgatherAddRMSNormPattern`"
+msgstr "`Qwen3VLMiddleLayerAllgatherAddRMSNormPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`all_gather` + slice + add + `layernorm`"
+msgstr "`all_gather` + slice + add + `layernorm`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "add(chunk) + `layernorm` + `all_gather`"
+msgstr "add(chunk) + `layernorm` + `all_gather`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`AllGatherChunkNoOpPattern`"
+msgstr "`AllGatherChunkNoOpPattern`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "`all_gather` + `sequence_parallel_chunk_impl`"
+msgstr "`all_gather` + `sequence_parallel_chunk_impl`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md
+msgid "identity (no-op)"
+msgstr "恒等操作（无操作）"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:98
+msgid "FAQ"
+msgstr "常见问题"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:100
+msgid "Q1: Is SP enabled by default?"
+msgstr "Q1: SP 是否默认启用？"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:102
+msgid ""
+"No, SP is not enabled by default. SP is currently in the experimental "
+"stage and will be enabled by default in the future."
+msgstr "不，SP 默认未启用。SP 目前处于实验阶段，未来将默认启用。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:104
+msgid "The processing flow of `enable_sp` in the code is:"
+msgstr "代码中 `enable_sp` 的处理流程如下："
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:106
+msgid "In `pass_config`, `enable_sp` and `sp_min_token_num` default to `None`"
+msgstr "在 `pass_config` 中，`enable_sp` 和 `sp_min_token_num` 默认为 `None`"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:107
+msgid ""
+"`NPUPlatform.apply_config_platform_defaults`: If `enable_sp` is `True` "
+"and `sp_min_token_num` is None, set default `sp_min_token_num` (1000 for "
+"Dense models, 1 for MoE models)"
+msgstr ""
+"`NPUPlatform.apply_config_platform_defaults`：如果 `enable_sp` 为 `True` 且 "
+"`sp_min_token_num` 为 None，则设置默认的 `sp_min_token_num`（Dense 模型为 1000，MoE 模型为 1）"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:108
+msgid ""
+"`VllmConfig._apply_optimization_level_defaults`: `enable_sp` is set to "
+"`True` for dense models."
+msgstr ""
+"`VllmConfig._apply_optimization_level_defaults`：对于 Dense 模型，`enable_sp` 被设置为 `True`。"
+
+#: ../../source/user_guide/feature_guide/sequence_parallelism.md:109
+msgid ""
+"`VllmConfig.__post_init__`: If `sp_min_token_num` is still `None`, then "
+"`enable_sp` is set to `False`"
+msgstr ""
+"`VllmConfig.__post_init__`：如果 `sp_min_token_num` 仍为 `None`，则 `enable_sp` 被设置为 `False`"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po
@@ -4,153 +4,139 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/feature_guide/sleep_mode.md:1
+#: ../../source/user_guide/feature_guide/sleep_mode.md:1
 msgid "Sleep Mode Guide"
 msgstr "睡眠模式指南"

-#: ../../user_guide/feature_guide/sleep_mode.md:3
+#: ../../source/user_guide/feature_guide/sleep_mode.md:3
 msgid "Overview"
 msgstr "概述"

-#: ../../user_guide/feature_guide/sleep_mode.md:5
+#: ../../source/user_guide/feature_guide/sleep_mode.md:5
 msgid ""
-"Sleep Mode is an API designed to offload model weights and discard KV cache "
-"from NPU memory. This functionality is essential for reinforcement learning "
-"(RL) post-training workloads, particularly in online algorithms such as PPO,"
-" GRPO, or DPO. During training, the policy model typically performs auto-"
-"regressive generation using inference engines like vLLM, followed by forward"
-" and backward passes for optimization."
+"Sleep Mode is an API designed to offload model weights and discard KV "
+"cache from NPU memory. This functionality is essential for reinforcement "
+"learning (RL) post-training workloads, particularly in online algorithms "
+"such as PPO, GRPO, or DPO. During training, the policy model typically "
+"performs autoregressive generation using inference engines like vLLM, "
+"followed by forward and backward passes for optimization."
 msgstr ""
-"Sleep Mode 是一个用于卸载模型权重并清除 NPU 内存中 KV 缓存的 API。此功能对于强化学习（RL）后训练任务尤其重要，特别是在 "
-"PPO、GRPO 或 DPO 等在线算法中。在训练过程中，策略模型通常会使用像 vLLM "
-"这样的推理引擎进行自回归生成，然后进行前向和反向传播以进行优化。"
+"睡眠模式是一个专为从NPU内存中卸载模型权重并丢弃KV缓存而设计的API。此功能对于强化学习（RL）后训练工作负载至关重要，特别是在PPO、GRPO或DPO等在线算法中。在训练期间，策略模型通常使用vLLM等推理引擎执行自回归生成，随后进行前向和反向传播以完成优化。"

-#: ../../user_guide/feature_guide/sleep_mode.md:7
+#: ../../source/user_guide/feature_guide/sleep_mode.md:7
 msgid ""
 "Since the generation and training phases may employ different model "
-"parallelism strategies, it becomes crucial to free KV cache and even offload"
-" model parameters stored within vLLM during training. This ensures efficient"
-" memory utilization and avoids resource contention on the NPU."
+"parallelism strategies, it becomes crucial to free KV cache and even "
+"offload model parameters stored within vLLM during training. This ensures"
+" efficient memory utilization and avoids resource contention on the NPU."
 msgstr ""
-"由于生成和训练阶段可能采用不同的模型并行策略，因此在训练过程中及时释放 KV 缓存，甚至卸载存储在 vLLM "
-"内的模型参数变得至关重要。这可以确保内存的高效利用，并避免 NPU 上的资源争用。"
+"由于生成阶段和训练阶段可能采用不同的模型并行策略，因此在训练期间释放KV缓存，甚至卸载存储在vLLM中的模型参数变得至关重要。这确保了高效的内存利用，并避免了NPU上的资源争用。"

-#: ../../user_guide/feature_guide/sleep_mode.md:10
+#: ../../source/user_guide/feature_guide/sleep_mode.md:9
 msgid "Getting started"
-msgstr "快速上手"
+msgstr "快速入门"

-#: ../../user_guide/feature_guide/sleep_mode.md:12
+#: ../../source/user_guide/feature_guide/sleep_mode.md:11
 #, python-brace-format
 msgid ""
-"With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in "
-"vllm will under a specific memory pool, during loading model and initialize "
-"kv_caches, we tag the memory as a map: `{\"weight\": data, \"kv_cache\": "
-"data}`."
+"With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in"
+" vllm is under a specific memory pool. During model loading and KV cache "
+"initialization, we tag the memory as a map: `{\"weight\": data, "
+"\"kv_cache\": data}`."
 msgstr ""
-"当 `enable_sleep_mode=True` 时，我们在 vllm 中管理内存（malloc, "
-"free）的方式会在一个特定的内存池下进行，在加载模型和初始化 kv_caches "
-"期间，我们会将内存打上标签，组织成一个映射：`{\"weight\": data, \"kv_cache\": data}`。"
+"当设置 `enable_sleep_mode=True` 时，我们在vllm中管理内存（分配、释放）的方式将在一个特定的内存池下进行。在模型加载和KV缓存初始化期间，我们将内存标记为一个映射：`{\"weight\": data, \"kv_cache\": data}`。"

-#: ../../user_guide/feature_guide/sleep_mode.md:14
+#: ../../source/user_guide/feature_guide/sleep_mode.md:13
 msgid ""
-"The engine(v0/v1) supports two sleep levels to manage memory during idle "
-"periods:"
-msgstr "该引擎（v0/v1）支持两种睡眠等级，以在空闲期间管理内存："
+"The engine (v0/v1) supports two sleep levels to manage memory during idle"
+" periods:"
+msgstr "引擎（v0/v1）支持两种睡眠等级，用于在空闲期间管理内存："

-#: ../../user_guide/feature_guide/sleep_mode.md:16
+#: ../../source/user_guide/feature_guide/sleep_mode.md:15
 msgid "Level 1 Sleep"
 msgstr "一级睡眠"

-#: ../../user_guide/feature_guide/sleep_mode.md:17
+#: ../../source/user_guide/feature_guide/sleep_mode.md:16
 msgid "Action: Offloads model weights and discards the KV cache."
-msgstr "操作：卸载模型权重并清除KV缓存。"
+msgstr "操作：卸载模型权重并丢弃KV缓存。"

-#: ../../user_guide/feature_guide/sleep_mode.md:18
+#: ../../source/user_guide/feature_guide/sleep_mode.md:17
 msgid "Memory: Model weights are moved to CPU memory; KV cache is forgotten."
-msgstr "内存：模型权重被移动到CPU内存；KV缓存被清除。"
+msgstr "内存：模型权重被移至CPU内存；KV缓存被清除。"

-#: ../../user_guide/feature_guide/sleep_mode.md:19
+#: ../../source/user_guide/feature_guide/sleep_mode.md:18
 msgid "Use Case: Suitable when reusing the same model later."
-msgstr "用例：适用于之后需要重复使用同一个模型的情况。"
+msgstr "用例：适用于后续需要复用同一模型的情况。"

-#: ../../user_guide/feature_guide/sleep_mode.md:20
-msgid ""
-"Note: Ensure sufficient CPU memory is available to hold the model weights."
-msgstr "注意：请确保有足够的CPU内存来存储模型权重。"
+#: ../../source/user_guide/feature_guide/sleep_mode.md:19
+msgid "Note: Ensure sufficient CPU memory is available to hold the model weights."
+msgstr "注意：确保有足够的CPU内存来容纳模型权重。"

-#: ../../user_guide/feature_guide/sleep_mode.md:22
+#: ../../source/user_guide/feature_guide/sleep_mode.md:21
 msgid "Level 2 Sleep"
 msgstr "二级睡眠"

-#: ../../user_guide/feature_guide/sleep_mode.md:23
+#: ../../source/user_guide/feature_guide/sleep_mode.md:22
 msgid "Action: Discards both model weights and KV cache."
 msgstr "操作：同时丢弃模型权重和KV缓存。"

-#: ../../user_guide/feature_guide/sleep_mode.md:24
-msgid ""
-"Memory: The content of both the model weights and kv cache is forgotten."
-msgstr "内存：模型权重和kv缓存的内容都会被遗忘。"
+#: ../../source/user_guide/feature_guide/sleep_mode.md:23
+msgid "Memory: The content of both the model weights and KV cache is forgotten."
+msgstr "内存：模型权重和KV缓存的内容均被清除。"

-#: ../../user_guide/feature_guide/sleep_mode.md:25
+#: ../../source/user_guide/feature_guide/sleep_mode.md:24
 msgid ""
-"Use Case: Ideal when switching to a different model or updating the current "
-"one."
-msgstr "用例：当切换到不同的模型或更新当前模型时非常理想。"
+"Use Case: Ideal when switching to a different model or updating the "
+"current one."
+msgstr "用例：当需要切换到不同模型或更新当前模型时，此模式非常理想。"

-#: ../../user_guide/feature_guide/sleep_mode.md:27
+#: ../../source/user_guide/feature_guide/sleep_mode.md:26
 msgid ""
 "Since this feature uses the low-level API "
 "[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html),"
 " in order to use sleep mode, you should follow the [installation "
-"guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and "
-"building from source, if you are using v0.7.3, remember to set `export "
-"COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment "
-"variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building "
-"from source."
+"guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html) "
+"and build from source. If you are using < v0.12.0rc1, remember to set "
+"`export COMPILE_CUSTOM_KERNELS=1`."
 msgstr ""
-"由于此功能使用了底层 API "
-"[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html)，为了使用休眠模式，你应按照[安装指南](https://vllm-"
-"ascend.readthedocs.io/en/latest/installation.html)进行操作，并从源码编译。如果你使用的是 "
-"v0.7.3，请记得设置 `export COMPILE_CUSTOM_KERNELS=1` ；对于最新版本（v0.9.x+），在从源码编译时环境变量 "
-"`COMPILE_CUSTOM_KERNELS` 默认会被设置为 1。"
+"由于此功能使用了底层API "
+"[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html)，为了使用睡眠模式，您应遵循[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)并从源码构建。如果您使用的版本低于v0.12.0rc1，请记得设置 `export COMPILE_CUSTOM_KERNELS=1`。"

-#: ../../user_guide/feature_guide/sleep_mode.md:29
+#: ../../source/user_guide/feature_guide/sleep_mode.md:28
 msgid "Usage"
 msgstr "用法"

-#: ../../user_guide/feature_guide/sleep_mode.md:31
+#: ../../source/user_guide/feature_guide/sleep_mode.md:30
 msgid "The following is a simple example of how to use sleep mode."
-msgstr "以下是如何使用睡眠模式的一个简单示例。"
+msgstr "以下是一个如何使用睡眠模式的简单示例。"

-#: ../../user_guide/feature_guide/sleep_mode.md:33
-msgid "offline inference:"
+#: ../../source/user_guide/feature_guide/sleep_mode.md:32
+msgid "Offline inference:"
 msgstr "离线推理："

-#: ../../user_guide/feature_guide/sleep_mode.md:72
-msgid "online serving:"
+#: ../../source/user_guide/feature_guide/sleep_mode.md:72
+msgid "Online serving:"
 msgstr "在线服务："

-#: ../../user_guide/feature_guide/sleep_mode.md:74
+#: ../../source/user_guide/feature_guide/sleep_mode.md:74
 msgid ""
-"Considering there may be a risk of malicious access, please make sure you "
-"are under a dev-mode, and explicit specify the develop env: "
-"`VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up)."
+"Considering there may be a risk of malicious access, please make sure you"
+" are under a dev-mode, and explicitly specify the dev environment "
+"`VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up)."
 msgstr ""
-"鉴于可能存在恶意访问的风险，请确保您处于开发模式，并明确指定开发环境：`VLLM_SERVER_DEV_MODE`，以便开放这些端点（sleep/wake"
-" up）。"
+"考虑到可能存在恶意访问的风险，请确保您处于开发模式，并明确指定开发环境变量 `VLLM_SERVER_DEV_MODE` 以开放这些端点（sleep/wake up）。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/speculative_decoding.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/speculative_decoding.po
@@ -0,0 +1,164 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:1
+msgid "Speculative Decoding Guide"
+msgstr "推测解码指南"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:3
+msgid ""
+"This guide shows how to use Speculative Decoding with vLLM Ascend. "
+"Speculative decoding is a technique which improves inter-token latency in"
+" memory-bound LLM inference."
+msgstr "本指南展示了如何在 vLLM Ascend 中使用推测解码。推测解码是一种技术，用于改善内存受限的 LLM 推理中的令牌间延迟。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:5
+msgid "Speculating by matching n-grams in the prompt"
+msgstr "通过匹配提示中的 n-gram 进行推测"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:7
+msgid ""
+"The following code configures vLLM Ascend to use speculative decoding "
+"where proposals are generated by matching n-grams in the prompt."
+msgstr "以下代码配置 vLLM Ascend 使用推测解码，其中候选令牌通过匹配提示中的 n-gram 生成。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:9
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:42
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:127
+msgid "Offline inference"
+msgstr "离线推理"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:36
+msgid "Speculating using EAGLE based draft models"
+msgstr "使用基于 EAGLE 的草稿模型进行推测"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:38
+msgid ""
+"The following code configures vLLM Ascend to use speculative decoding "
+"where proposals are generated by an [EAGLE (Extrapolation Algorithm for "
+"Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) "
+"based draft model."
+msgstr "以下代码配置 vLLM Ascend 使用推测解码，其中候选令牌由基于 [EAGLE（用于提升语言模型效率的外推算法）](https://arxiv.org/pdf/2401.15077) 的草稿模型生成。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:40
+msgid ""
+"In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and "
+"ready to be enabled. We have adapted it to support EAGLE, and you can use"
+" it by setting `async_scheduling=True` as follows. If you encounter any "
+"issues, please feel free to open an issue on GitHub. As a workaround, you"
+" can disable this feature by unsetting `async_scheduling=True` when "
+"initializing the model."
+msgstr "在 vLLM Ascend 的 v0.12.0rc1 版本中，异步调度器更加稳定并已准备就绪。我们已使其适配以支持 EAGLE，您可以通过如下设置 `async_scheduling=True` 来使用它。如果您遇到任何问题，请随时在 GitHub 上提交 issue。作为一种变通方案，您可以在初始化模型时不设置 `async_scheduling=True` 来禁用此功能。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:74
+msgid ""
+"A few important things to consider when using the EAGLE based draft "
+"models:"
+msgstr "使用基于 EAGLE 的草稿模型时，需要考虑以下几点重要事项："
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:76
+msgid ""
+"The EAGLE draft models available in the [HF repository for EAGLE "
+"models](https://huggingface.co/yuhuili) should be loaded and used "
+"directly by vLLM. This functionality was added in PR "
+"[#4893](https://github.com/vllm-project/vllm-ascend/pull/4893). If you "
+"are using a vLLM version released before this pull request was merged, "
+"please update to a more recent version."
+msgstr "[EAGLE 模型的 HF 仓库](https://huggingface.co/yuhuili) 中可用的 EAGLE 草稿模型应由 vLLM 直接加载和使用。此功能在 PR [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893) 中添加。如果您使用的 vLLM 版本是在此拉取请求合并之前发布的，请更新到较新的版本。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:80
+msgid ""
+"The EAGLE based draft models need to be run without tensor parallelism "
+"(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), "
+"although it is possible to run the main model using tensor parallelism "
+"(see example above)."
+msgstr "基于 EAGLE 的草稿模型需要在没有张量并行的情况下运行（即在 `speculative_config` 中 `draft_tensor_parallel_size` 设置为 1），尽管主模型可以使用张量并行运行（参见上面的示例）。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:84
+msgid ""
+"When using EAGLE-3 based draft model, option \"method\" must be set to "
+"\"eagle3\". That is, to specify `\"method\": \"eagle3\"` in "
+"`speculative_config`."
+msgstr "当使用基于 EAGLE-3 的草稿模型时，选项 \"method\" 必须设置为 \"eagle3\"。也就是说，在 `speculative_config` 中指定 `\"method\": \"eagle3\"`。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:87
+msgid ""
+"After enabling EAGLE, the main model needs to verify `(1 + K)` tokens "
+"generated by the main model and the draft model in one decoding process. "
+"And the fullgraph mode will fix the number of tokens during the "
+"verification stage, so `cudagraph_capture_sizes` must be a list of "
+"capture sizes, where each size is calculated as `n * (K + 1)` for each "
+"batch size `n` you want to support. For instance, to support batch sizes "
+"from 1 to 4 with `num_speculative_tokens = 4`, `cudagraph_capture_sizes` "
+"should be set to `[5, 10, 15, 20]`."
+msgstr "启用 EAGLE 后，主模型需要在一个解码过程中验证由主模型和草稿模型生成的 `(1 + K)` 个令牌。并且 fullgraph 模式将在验证阶段固定令牌数量，因此 `cudagraph_capture_sizes` 必须是一个捕获大小列表，其中每个大小计算为 `n * (K + 1)`，`n` 是您希望支持的每个批次大小。例如，要支持批次大小从 1 到 4 且 `num_speculative_tokens = 4`，`cudagraph_capture_sizes` 应设置为 `[5, 10, 15, 20]`。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:92
+msgid "Speculating using MTP speculators"
+msgstr "使用 MTP 推测器进行推测"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:94
+msgid ""
+"The following code configures vLLM Ascend to use speculative decoding "
+"where proposals are generated by MTP (Multi Token Prediction), boosting "
+"inference performance by parallelizing the prediction of multiple tokens."
+" For more information about MTP see "
+"[Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html)"
+msgstr "以下代码配置 vLLM Ascend 使用推测解码，其中候选令牌由 MTP（多令牌预测）生成，通过并行预测多个令牌来提升推理性能。有关 MTP 的更多信息，请参阅 [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html)"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:96
+msgid "Online inference"
+msgstr "在线推理"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:116
+msgid "Speculating using Suffix Decoding"
+msgstr "使用后缀解码进行推测"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:118
+msgid ""
+"The following code configures vLLM to use speculative decoding where "
+"proposals are generated using Suffix Decoding [(SuffixDecoding: Extreme "
+"Speculative Decoding for Emerging AI "
+"Applications)](https://arxiv.org/abs/2411.04975)."
+msgstr "以下代码配置 vLLM 使用推测解码，其中候选令牌使用后缀解码生成 [(SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications)](https://arxiv.org/abs/2411.04975)。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:120
+msgid ""
+"Like n-gram, Suffix Decoding can generate draft tokens by pattern-"
+"matching using the last `n` generated tokens. Unlike n-gram, Suffix "
+"Decoding (1) can pattern-match against both the prompt and previous "
+"generations, (2) uses frequency counts to propose the most likely "
+"continuations, and (3) speculates an adaptive number of tokens for each "
+"request at each iteration to get better acceptance rates."
+msgstr "与 n-gram 类似，后缀解码可以通过使用最后 `n` 个生成的令牌进行模式匹配来生成草稿令牌。与 n-gram 不同，后缀解码 (1) 可以针对提示和先前生成的内容进行模式匹配，(2) 使用频率计数来提出最可能的延续序列，(3) 在每次迭代中为每个请求推测自适应数量的令牌，以获得更好的接受率。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:122
+msgid ""
+"Suffix Decoding can achieve better performance for tasks with high "
+"repetition, such as code-editing, agentic loops (e.g. self-reflection, "
+"self-consistency), and RL rollouts."
+msgstr "后缀解码可以在具有高重复性的任务上实现更好的性能，例如代码编辑、智能体循环（例如自我反思、自我一致性）和 RL 推演。"
+
+#: ../../source/user_guide/feature_guide/speculative_decoding.md:124
+msgid ""
+"[!NOTE] Suffix Decoding requires Arctic Inference. You can install it "
+"with `pip install arctic-inference`."
+msgstr "[!注意] 后缀解码需要 Arctic Inference。您可以使用 `pip install arctic-inference` 安装它。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/structured_output.po
@@ -4,217 +4,73 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/feature_guide/structured_output.md:1
+#: ../../source/user_guide/feature_guide/structured_output.md:1
 msgid "Structured Output Guide"
 msgstr "结构化输出指南"

-#: ../../user_guide/feature_guide/structured_output.md:3
+#: ../../source/user_guide/feature_guide/structured_output.md:3
 msgid "Overview"
 msgstr "概述"

-#: ../../user_guide/feature_guide/structured_output.md:5
-msgid "What is Structured Output?"
+#: ../../source/user_guide/feature_guide/structured_output.md:5
+msgid "What is structured output?"
 msgstr "什么是结构化输出？"

-#: ../../user_guide/feature_guide/structured_output.md:7
+#: ../../source/user_guide/feature_guide/structured_output.md:7
 msgid ""
-"LLMs can be unpredictable when you need output in specific formats. Think of"
-" asking a model to generate JSON - without guidance, it might produce valid "
-"text that breaks JSON specification. **Structured Output (also called Guided"
-" Decoding)** enables LLMs to generate outputs that follow a desired "
-"structure while preserving the non-deterministic nature of the system."
+"LLMs can be unpredictable when you need output in specific formats. Think"
+" of asking a model to generate JSON without guidance, it might produce "
+"valid text that breaks JSON specification. **Structured Output (also "
+"known as Guided Decoding)** enables LLMs to generate outputs that follow "
+"a desired structure while preserving the non-deterministic nature of the "
+"system."
 msgstr ""
-"当你需要特定格式输出时，大型语言模型（LLMs）可能表现出不可预测性。比如让模型生成 "
-"JSON，如果没有指导，模型可能会生成有效的文本，但这些文本却不符合 JSON 规范。**结构化输出（也称为引导解码）** "
-"能让大型语言模型生成符合预期结构的输出，同时保留系统的非确定性特性。"
+"当您需要特定格式的输出时，大型语言模型（LLMs）的行为可能难以预测。试想一下，在没有指导的情况下要求模型生成"
+" JSON，它可能会生成有效的文本，但却破坏了 JSON 规范。**结构化输出（也称为引导解码）** "
+"使大型语言模型能够生成符合预期结构的输出，同时保留系统的非确定性特性。"

-#: ../../user_guide/feature_guide/structured_output.md:9
+#: ../../source/user_guide/feature_guide/structured_output.md:9
 msgid ""
-"In simple terms, structured decoding gives LLMs a “template” to follow. "
-"Users provide a schema that “influences” the model’s output, ensuring "
+"In simple terms, structured decoding gives LLMs a \"template\" to follow."
+" Users provide a schema that \"influences\" the model output, ensuring "
 "compliance with the desired structure."
-msgstr "简单来说，结构化解码为LLM提供了一个“模板”来遵循。用户提供一个模式来“影响”模型的输出，从而确保输出符合期望的结构。"
+msgstr "简而言之，结构化解码为大型语言模型提供了一个需要遵循的“模板”。用户提供一个“影响”模型输出的模式，以确保输出符合期望的结构。"

-#: ../../user_guide/feature_guide/structured_output.md:11
+#: ../../source/user_guide/feature_guide/structured_output.md:11
 msgid "![structured decoding](./images/structured_output_1.png)"
 msgstr "![结构化解码](./images/structured_output_1.png)"

-#: ../../user_guide/feature_guide/structured_output.md:11
+#: ../../source/user_guide/feature_guide/structured_output.md:11
 msgid "structured decoding"
 msgstr "结构化解码"

-#: ../../user_guide/feature_guide/structured_output.md:13
-msgid "Structured Output in vllm-ascend"
-msgstr "vllm-ascend 中的结构化输出"
+#: ../../source/user_guide/feature_guide/structured_output.md:13
+msgid "Usage in vllm-ascend"
+msgstr "在 vllm-ascend 中的使用"

-#: ../../user_guide/feature_guide/structured_output.md:15
+#: ../../source/user_guide/feature_guide/structured_output.md:15
 msgid ""
-"Currently, vllm-ascend supports **xgrammar** and **guidance** backend for "
-"structured output with vllm v1 engine."
-msgstr "目前，vllm-ascend 支持 vllm v1 引擎的结构化输出，后端包括 **xgrammar** 和 **guidance**。"
+"Currently, the usage of structured output feature in vllm-ascend is "
+"totally the same as that in vllm."
+msgstr "目前，vllm-ascend 中结构化输出功能的使用方式与 vllm 中完全相同。"

-#: ../../user_guide/feature_guide/structured_output.md:17
+#: ../../source/user_guide/feature_guide/structured_output.md:17
 msgid ""
-"XGrammar introduces a new technique that batch constrained decoding via "
-"pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, "
-"and each FSM represents a context-free grammar (CFG).” One significant "
-"advantage of PDA is its recursive nature, allowing us to execute multiple "
-"state transitions. They also include additional optimisation (for those who "
-"are interested) to reduce grammar compilation overhead. Besides, you can "
-"also find more details about guidance by yourself."
-msgstr ""
-"XGrammar 引入了一种通过下推自动机（PDA）进行批量约束解码的新技术。你可以把 PDA 理解为“有限状态机（FSM）的集合，每个 FSM "
-"代表一个上下文无关文法（CFG）。” PDA 的一个重要优点是其递归特性，使我们能够执行多次状态转移。此外，PDA "
-"还包含了额外的优化（供感兴趣的用户参考），以减少语法编译的开销。除此之外，你还可以自己找到更多关于指导的信息。"
-
-#: ../../user_guide/feature_guide/structured_output.md:19
-msgid "How to Use Structured Output?"
-msgstr "如何使用结构化输出？"
-
-#: ../../user_guide/feature_guide/structured_output.md:21
-msgid "Online Inference"
-msgstr "在线推理"
-
-#: ../../user_guide/feature_guide/structured_output.md:23
-msgid ""
-"You can also generate structured outputs using the OpenAI's Completions and "
-"Chat API. The following parameters are supported, which must be added as "
-"extra parameters:"
-msgstr "你也可以使用 OpenAI 的 Completions 和 Chat API 生成结构化输出。支持以下参数，这些参数必须作为额外参数添加："
-
-#: ../../user_guide/feature_guide/structured_output.md:25
-msgid "`guided_choice`: the output will be exactly one of the choices."
-msgstr "`guided_choice`：输出将会是其中一个选项。"
-
-#: ../../user_guide/feature_guide/structured_output.md:26
-msgid "`guided_regex`: the output will follow the regex pattern."
-msgstr "`guided_regex`：输出将遵循正则表达式模式。"
-
-#: ../../user_guide/feature_guide/structured_output.md:27
-msgid "`guided_json`: the output will follow the JSON schema."
-msgstr "`guided_json`：输出将遵循 JSON 架构。"
-
-#: ../../user_guide/feature_guide/structured_output.md:28
-msgid "`guided_grammar`: the output will follow the context free grammar."
-msgstr "`guided_grammar`：输出将遵循上下文无关文法。"
-
-#: ../../user_guide/feature_guide/structured_output.md:30
-msgid ""
-"Structured outputs are supported by default in the OpenAI-Compatible Server."
-" You can choose to specify the backend to use by setting the `--guided-"
-"decoding-backend` flag to vllm serve. The default backend is `auto`, which "
-"will try to choose an appropriate backend based on the details of the "
-"request. You may also choose a specific backend, along with some options."
-msgstr ""
-"OpenAI 兼容服务器默认支持结构化输出。你可以通过设置 `--guided-decoding-backend` 标志为 vllm serve "
-"来指定要使用的后端。默认后端为 `auto`，它会根据请求的详细信息尝试选择合适的后端。你也可以选择特定的后端，并设置一些选项。"
-
-#: ../../user_guide/feature_guide/structured_output.md:32
-msgid ""
-"Now let´s see an example for each of the cases, starting with the "
-"guided_choice, as it´s the easiest one:"
-msgstr "现在让我们来看每种情况的示例，首先是 guided_choice，因为它是最简单的："
-
-#: ../../user_guide/feature_guide/structured_output.md:51
-msgid ""
-"The next example shows how to use the guided_regex. The idea is to generate "
-"an email address, given a simple regex template:"
-msgstr "下一个例子展示了如何使用 guided_regex。其思路是基于一个简单的正则表达式模板生成一个电子邮件地址："
-
-#: ../../user_guide/feature_guide/structured_output.md:67
-msgid ""
-"One of the most relevant features in structured text generation is the "
-"option to generate a valid JSON with pre-defined fields and formats. For "
-"this we can use the guided_json parameter in two different ways:"
-msgstr ""
-"在结构化文本生成中，最相关的特性之一是能够生成具有预定义字段和格式的有效 JSON。为此，我们可以通过两种不同的方式使用 guided_json 参数："
-
-#: ../../user_guide/feature_guide/structured_output.md:69
-msgid "Using a JSON Schema."
-msgstr "使用 JSON 架构。"
-
-#: ../../user_guide/feature_guide/structured_output.md:70
-msgid "Defining a Pydantic model and then extracting the JSON Schema from it."
-msgstr "定义一个 Pydantic 模型，然后从中提取 JSON Schema。"
-
-#: ../../user_guide/feature_guide/structured_output.md:72
-msgid ""
-"The next example shows how to use the guided_json parameter with a Pydantic "
-"model:"
-msgstr "下一个示例展示了如何将 guided_json 参数与 Pydantic 模型一起使用："
-
-#: ../../user_guide/feature_guide/structured_output.md:104
-msgid ""
-"Finally we have the guided_grammar option, which is probably the most "
-"difficult to use, but it´s really powerful. It allows us to define complete "
-"languages like SQL queries. It works by using a context free EBNF grammar. "
-"As an example, we can use to define a specific format of simplified SQL "
-"queries:"
-msgstr ""
-"最后，我们有 guided_grammar 选项，这可能是最难使用的，但它非常强大。它允许我们定义完整的语言，比如 SQL 查询。它通过使用上下文无关的"
-" EBNF 语法来实现。例如，我们可以用它来定义一种简化 SQL 查询的特定格式："
-
-#: ../../user_guide/feature_guide/structured_output.md:134
-msgid ""
-"Find more examples [here](https://github.com/vllm-"
-"project/vllm/blob/main/examples/offline_inference/structured_outputs.py)."
-msgstr ""
-"在[这里](https://github.com/vllm-"
-"project/vllm/blob/main/examples/offline_inference/structured_outputs.py)可以找到更多示例。"
-
-#: ../../user_guide/feature_guide/structured_output.md:136
-msgid "Offline Inference"
-msgstr "离线推理"
-
-#: ../../user_guide/feature_guide/structured_output.md:138
-msgid ""
-"To use Structured Output, we'll need to configure the guided decoding using "
-"the class `GuidedDecodingParams` inside `SamplingParams`. The main available"
-" options inside `GuidedDecodingParams` are:"
-msgstr ""
-"要使用结构化输出，我们需要在 `SamplingParams` 内通过 `GuidedDecodingParams` "
-"类配置引导解码。`GuidedDecodingParams` 中主要可用的选项有："
-
-#: ../../user_guide/feature_guide/structured_output.md:140
-msgid "json"
-msgstr "json"
-
-#: ../../user_guide/feature_guide/structured_output.md:141
-msgid "regex"
-msgstr "正则表达式"
-
-#: ../../user_guide/feature_guide/structured_output.md:142
-msgid "choice"
-msgstr "选择"
-
-#: ../../user_guide/feature_guide/structured_output.md:143
-msgid "grammar"
-msgstr "语法"
-
-#: ../../user_guide/feature_guide/structured_output.md:145
-msgid "One example for the usage of the choice parameter is shown below:"
-msgstr "choice 参数用法的一个示例如下："
-
-#: ../../user_guide/feature_guide/structured_output.md:163
-msgid ""
-"Find more examples of other usages [here](https://github.com/vllm-"
-"project/vllm/blob/main/examples/offline_inference/structured_outputs.py)."
-msgstr ""
-"查看更多其他用法的示例 [在这里](https://github.com/vllm-"
-"project/vllm/blob/main/examples/offline_inference/structured_outputs.py)。"
+"Find more examples and explanations about these usages in [vLLM official "
+"document](https://docs.vllm.ai/en/stable/features/structured_outputs/)."
+msgstr "更多关于这些用法的示例和解释，请参阅 [vLLM 官方文档](https://docs.vllm.ai/en/stable/features/structured_outputs/)。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po
@@ -0,0 +1,219 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:1
+msgid "UCM-Enhanced Prefix Caching Deployment Guide"
+msgstr "UCM增强前缀缓存部署指南"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:3
+msgid "Overview"
+msgstr "概述"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:5
+msgid ""
+"Unified Cache Management (UCM) provides an external KV-cache storage "
+"layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike "
+"KV Pooling, which expands prefix-cache capacity only by aggregating "
+"device memory and therefore remains limited by HBM/DRAM size and lacks "
+"persistence, UCM decouples compute from storage and adopts a tiered "
+"design. Each node uses local DRAM as a fast cache, while a shared "
+"backend—such as 3FS or enterprise-grade storage—serves as the persistent "
+"KV store. This approach removes the capacity ceiling imposed by device "
+"memory, enables durable and reliable prefix caching, and allows cache "
+"capacity to scale with the storage system rather than with compute "
+"resources."
+msgstr ""
+"统一缓存管理（UCM）为vLLM/vLLM-Ascend中的前缀缓存场景提供了一个外部的KV缓存存储层。与仅通过聚合设备内存来扩展前缀缓存容量、因此仍受限于HBM/DRAM大小且缺乏持久性的KV池化不同，UCM将计算与存储解耦，并采用分层设计。每个节点使用本地DRAM作为快速缓存，而共享后端（如3FS或企业级存储）则作为持久化的KV存储。这种方法消除了设备内存带来的容量上限，实现了持久可靠的前缀缓存，并使缓存容量能够随存储系统而非计算资源扩展。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:7
+msgid "Prerequisites"
+msgstr "先决条件"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:9
+msgid "OS: Linux"
+msgstr "操作系统：Linux"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:10
+msgid "Hardware with Ascend NPUs. It's usually the Atlas 800 A2 series."
+msgstr "配备昇腾NPU的硬件。通常是Atlas 800 A2系列。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:11
+msgid "**vLLM: main branch**"
+msgstr "**vLLM：main分支**"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:12
+msgid "**vLLM Ascend: main branch**"
+msgstr "**vLLM Ascend：main分支**"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:14
+msgid "UCM Installation"
+msgstr "UCM安装"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:16
+msgid ""
+"**Please refer to the [official UCM installation guide for Ascend "
+"NPU](https://ucm.readthedocs.io/en/latest/getting-"
+"started/quickstart_vllm_ascend.html)**"
+msgstr ""
+"**请参考[昇腾NPU的官方UCM安装指南](https://ucm.readthedocs.io/en/latest/getting-started/quickstart_vllm_ascend.html)**"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:18
+msgid "Configure UCM for Prefix Caching"
+msgstr "为前缀缓存配置UCM"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:20
+msgid ""
+"Modify the UCM configuration file to specify which UCM connector to use "
+"and where KV blocks should be stored. You may directly edit the example "
+"file at:"
+msgstr "修改UCM配置文件以指定使用哪个UCM连接器以及KV块应存储在何处。您可以直接编辑位于以下路径的示例文件："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:23
+msgid "`unified-cache-management/examples/ucm_config_example.yaml`"
+msgstr "`unified-cache-management/examples/ucm_config_example.yaml`"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:25
+msgid ""
+"**For updated configuration options, please refer to the [official UCM "
+"documentation for prefix-caching](https://ucm.readthedocs.io/en/latest"
+"/user-guide/prefix-cache/nfs_store.html)**"
+msgstr ""
+"**有关最新的配置选项，请参考[前缀缓存的官方UCM文档](https://ucm.readthedocs.io/en/latest/user-guide/prefix-cache/nfs_store.html)**"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:27
+msgid "A minimal configuration looks like this:"
+msgstr "一个最小配置示例如下："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:39
+msgid "Explanation:"
+msgstr "说明："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:41
+msgid ""
+"ucm_connector_name: \"UcmNfsStore\": Specifies `UcmNfsStore` as the UCM "
+"connector."
+msgstr "ucm_connector_name: \"UcmNfsStore\"：指定`UcmNfsStore`作为UCM连接器。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:44
+msgid ""
+"storage_backends: Specify the directory used for storing KV blocks. It "
+"can be a local directory or an NFS-mounted path. UCM will store KV blocks"
+" here.  **⚠️ Make sure to replace `\"/mnt/test\"` with your actual "
+"storage directory.**"
+msgstr ""
+"storage_backends：指定用于存储KV块的目录。它可以是本地目录或NFS挂载路径。UCM将在此处存储KV块。**⚠️ 请确保将`\"/mnt/test\"`替换为您的实际存储目录。**"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:48
+msgid "use_direct: Whether to enable direct I/O (optional). Default is `false`."
+msgstr "use_direct：是否启用直接I/O（可选）。默认为`false`。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:51
+msgid ""
+"load_only_first_rank: Controls whether only rank 0 loads KV cache and "
+"broadcasts it to other ranks.   This feature is currently not supported "
+"on Ascend, so it must be set to `false` (all ranks load/dump "
+"independently)."
+msgstr ""
+"load_only_first_rank：控制是否仅rank 0加载KV缓存并将其广播到其他rank。此功能目前在昇腾上不受支持，因此必须设置为`false`（所有rank独立加载/转储）。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:55
+msgid "Launching Inference"
+msgstr "启动推理"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:57
+msgid ""
+"In this guide, we describe **online inference** using vLLM with the UCM "
+"connector, deployed as an OpenAI-compatible server. For best performance "
+"with UCM, it is recommended to set `block_size` to 128."
+msgstr "在本指南中，我们描述使用带有UCM连接器的vLLM进行**在线推理**，部署为OpenAI兼容的服务器。为了获得UCM的最佳性能，建议将`block_size`设置为128。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:59
+msgid "To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:"
+msgstr "要使用Qwen/Qwen2.5-14B-Instruct模型启动vLLM服务器，请运行："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:79
+msgid ""
+"**⚠️ Make sure to replace `\"/vllm-workspace/unified-cache-"
+"management/examples/ucm_config_example.yaml\"` with your actual config "
+"file path.**"
+msgstr "**⚠️ 请确保将`\"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml\"`替换为您的实际配置文件路径。**"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:81
+msgid "If you see the log below:"
+msgstr "如果您看到以下日志："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:89
+msgid ""
+"Congratulations, you have successfully started the vLLM server with UCM "
+"connector!"
+msgstr "恭喜，您已成功启动带有UCM连接器的vLLM服务器！"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:91
+msgid "Evaluating UCM Prefix Caching Performance"
+msgstr "评估UCM前缀缓存性能"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:93
+msgid ""
+"After launching the vLLM server with `UCMConnector` enabled, the easiest "
+"way to observe the prefix caching effect is to run the built-in `vllm "
+"bench` CLI. Executing the following command **twice** in a separate "
+"terminal shows the improvement clearly."
+msgstr "在启用`UCMConnector`启动vLLM服务器后，观察前缀缓存效果的最简单方法是运行内置的`vllm bench` CLI。在单独的终端中**两次**执行以下命令可以清晰地展示改进效果。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:112
+msgid "After the first execution"
+msgstr "第一次执行后"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:114
+msgid "The `vllm bench` terminal prints the benchmark result:"
+msgstr "`vllm bench`终端打印基准测试结果："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:121
+msgid "Inspecting the vLLM server logs reveals entries like:"
+msgstr "检查vLLM服务器日志会发现类似条目："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:127
+msgid ""
+"This indicates that for the first inference request, UCM did not hit any "
+"cached KV blocks. As a result, the full 16K-token prefill must be "
+"computed, leading to a relatively large TTFT."
+msgstr "这表明对于第一个推理请求，UCM未命中任何缓存的KV块。因此，必须计算完整的16K令牌预填充，导致相对较大的TTFT。"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:129
+msgid "After the second execution"
+msgstr "第二次执行后"
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:131
+msgid "Running the same benchmark again produces:"
+msgstr "再次运行相同的基准测试会产生："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:138
+msgid "The vLLM server logs now contain similar entries:"
+msgstr "vLLM服务器日志现在包含类似条目："
+
+#: ../../source/user_guide/feature_guide/ucm_deployment.md:144
+msgid ""
+"This indicates that during the second request, UCM successfully retrieved"
+" all 125 cached KV blocks from the storage backend. Leveraging the fully "
+"cached prefix significantly reduces the initial latency observed by the "
+"model, yielding an approximate **8× improvement in TTFT** compared to the"
+" initial run."
+msgstr "这表明在第二次请求期间，UCM成功从存储后端检索了全部125个缓存的KV块。利用完全缓存的前缀显著减少了模型观察到的初始延迟，与首次运行相比，TTFT实现了约**8倍的提升**。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po
@@ -0,0 +1,171 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2025, vllm-ascend team
+# This file is distributed under the same license as the vllm-ascend
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: vllm-ascend \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.18.0\n"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:1
+msgid "Weight Prefetch Guide"
+msgstr "权重预取指南"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:3
+msgid ""
+"Weight prefetching optimizes memory usage by preloading weights into the "
+"cache before they are needed, minimizing delays caused by memory access "
+"during model execution. Linear layers sometimes exhibit relatively high "
+"MTE utilization. To address this, we create a separate pipeline "
+"specifically for weight prefetching, which runs in parallel with the "
+"original vector computation pipeline, such as quantize, MoE gating top_k,"
+" RMSNorm and SwiGlu. This approach allows the weights to be preloaded to "
+"L2 cache ahead of time, reducing MTE utilization during the linear layer "
+"computations and indirectly improving Cube computation efficiency by "
+"minimizing resource contention and optimizing data flow."
+msgstr ""
+"权重预取通过在需要之前将权重预加载到缓存中来优化内存使用，从而最小化模型执行期间因内存访问造成的延迟。线性层有时表现出相对较高的MTE利用率。为了解决这个问题，我们创建了一个专门用于权重预取的独立流水线，该流水线与原始向量计算流水线（如量化、MoE门控top_k、RMSNorm和SwiGlu）并行运行。这种方法允许权重提前预加载到L2缓存中，减少线性层计算期间的MTE利用率，并通过最小化资源争用和优化数据流间接提高Cube计算效率。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:5
+msgid ""
+"Since we use vector computations to hide the weight prefetching pipeline,"
+" this has an effect on computation. If you prioritize low latency over "
+"high throughput, it is best not to enable prefetching."
+msgstr ""
+"由于我们使用向量计算来隐藏权重预取流水线，这会对计算产生影响。如果您优先考虑低延迟而非高吞吐量，最好不要启用预取。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:7
+msgid "Quick Start"
+msgstr "快速开始"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:9
+#, python-brace-format
+msgid ""
+"With `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
+"true}}'` to open weight prefetch."
+msgstr ""
+"使用 `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
+"true}}'` 来开启权重预取。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:11
+msgid "Fine-tune Prefetch Ratio"
+msgstr "微调预取比例"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:13
+msgid ""
+"Since weight prefetch use vector computations to hide the weight "
+"prefetching pipeline, the setting of the prefetch size is crucial. If the"
+" size is too small, the optimization benefits will not be fully realized,"
+" while a larger size may lead to resource contention, resulting in "
+"performance degradation. To accommodate different scenarios, we have "
+"added `prefetch_ratio` to allow for flexible size configuration based on "
+"the specific workload, details as follows:"
+msgstr ""
+"由于权重预取使用向量计算来隐藏权重预取流水线，预取大小的设置至关重要。如果大小太小，则无法充分发挥优化优势；而较大的大小可能导致资源争用，从而导致性能下降。为了适应不同的场景，我们添加了`prefetch_ratio`，允许根据具体工作负载灵活配置大小，详情如下："
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:15
+msgid ""
+"With `prefetch_ratio` in `\"weight_prefetch_config\"` to custom the "
+"weight prefetch ratio for specific linear layers."
+msgstr ""
+"使用`\"weight_prefetch_config\"`中的`prefetch_ratio`来为特定的线性层自定义权重预取比例。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:17
+msgid ""
+"The “attn” and “moe” configuration options are used for MoE model, "
+"details as follows:"
+msgstr ""
+"“attn”和“moe”配置选项用于MoE模型，详情如下："
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:19
+#, python-brace-format
+msgid "`\"attn\": { \"qkv\": 1.0,  \"o\": 1.0},  \"moe\": {\"gate_up\": 0.8}`"
+msgstr "`\"attn\": { \"qkv\": 1.0,  \"o\": 1.0},  \"moe\": {\"gate_up\": 0.8}`"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:21
+msgid ""
+"The “mlp” configuration option is used to optimize the performance of the"
+" Dense model, details as follows:"
+msgstr ""
+"“mlp”配置选项用于优化Dense模型的性能，详情如下："
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:23
+#, python-brace-format
+msgid "`\"mlp\": {\"gate_up\": 1.0, \"down\": 1.0}`"
+msgstr "`\"mlp\": {\"gate_up\": 1.0, \"down\": 1.0}`"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:25
+msgid ""
+"Above value are the default config, the default value has a good "
+"performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs` is 144, for "
+"Qwen3-32B-W8A8 when `--max-num-seqs` is 72."
+msgstr ""
+"以上值为默认配置，当`--max-num-seqs`为144时，该默认值对Qwen3-235B-A22B-W8A8有良好性能；当`--max-num-seqs`为72时，对Qwen3-32B-W8A8有良好性能。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:27
+msgid ""
+"However, this may not be the optimal configuration for your scenario. For"
+" higher concurrency, you can try increasing the prefetch size. For lower "
+"concurrency, prefetching may not offer any advantages, so you can "
+"decrease the size or disable prefetching. Determine if the prefetch size "
+"is appropriate by collecting profiling data. Specifically, check if the "
+"time required for the prefetch operation (e.g., MLP Down Proj weight "
+"prefetching) overlaps with the time required for parallel vector "
+"computation operators (e.g., SwiGlu computation), and whether the "
+"prefetch operation is no later than the completion time of the vector "
+"computation operator. In the profiling timeline, a prefetch operation "
+"appears as a CMO operation on a single stream; this CMO operation is the "
+"prefetch operation."
+msgstr ""
+"然而，这可能不是您场景下的最优配置。对于更高的并发度，可以尝试增加预取大小。对于较低的并发度，预取可能不会带来任何优势，因此可以减少大小或禁用预取。通过收集性能分析数据来确定预取大小是否合适。具体来说，检查预取操作（例如，MLP Down Proj权重预取）所需的时间是否与并行向量计算算子（例如，SwiGlu计算）所需的时间重叠，以及预取操作是否不晚于向量计算算子的完成时间。在性能分析时间线中，预取操作显示为单个流上的CMO操作；此CMO操作即为预取操作。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:29
+msgid "Notes:"
+msgstr "注意："
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:31
+msgid ""
+"Weight prefetch of MLP `down` project prefetch depends on sequence "
+"parallel, if you want to open for mlp `down` please also enable sequence "
+"parallel."
+msgstr ""
+"MLP `down`投影的权重预取依赖于序列并行，如果您想为mlp `down`开启预取，请同时启用序列并行。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:32
+msgid ""
+"Due to the current size of the L2 cache, the maximum prefetch cannot "
+"exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 *"
+" 1024` bytes, the backend will only prefetch 18MB."
+msgstr ""
+"由于当前L2缓存的大小，最大预取量不能超过18MB。如果`prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024`字节，后端将只预取18MB。"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:34
+msgid "Example"
+msgstr "示例"
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:36
+msgid "For MoE model:"
+msgstr "对于MoE模型："
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:56
+msgid "For dense model:"
+msgstr "对于Dense模型："
+
+#: ../../source/user_guide/feature_guide/weight_prefetch.md:58
+msgid ""
+"Following is the default configuration that can get a good performance "
+"for `--max-num-seqs` is 72 for Qwen3-32B-W8A8"
+msgstr ""
+"以下是默认配置，当`--max-num-seqs`为72时，该配置可为Qwen3-32B-W8A8带来良好性能"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/index.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/index.po
@@ -3,28 +3,27 @@
 # This file is distributed under the same license as the PROJECT project.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
 "Project-Id-Version: PROJECT VERSION\n"
 "Report-Msgid-Bugs-To: EMAIL@ADDRESS\n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language-Team: LANGUAGE <LL@li.org>\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/support_matrix/index.md:5
+#: ../../source/user_guide/support_matrix/index.md:5
 msgid "Support Matrix"
 msgstr "支持矩阵"

-#: ../../user_guide/support_matrix/index.md:1
-msgid "Features and models"
+#: ../../source/user_guide/support_matrix/index.md:1
+msgid "Features and Models"
 msgstr "特性与模型"

-#: ../../user_guide/support_matrix/index.md:3
-msgid "This section provides a detailed supported matrix by vLLM Ascend."
-msgstr "本节提供了 vLLM Ascend 的详细支持矩阵。"
+#: ../../source/user_guide/support_matrix/index.md:3
+msgid "This section provides a detailed matrix supported by vLLM Ascend."
+msgstr "本节提供了 vLLM Ascend 支持的详细矩阵。"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_features.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_features.po
@@ -4,261 +4,297 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/support_matrix/supported_features.md:1
-msgid "Feature Support"
-msgstr "功能支持"
+#: ../../source/user_guide/support_matrix/supported_features.md:1
+msgid "Supported Features"
+msgstr "支持的功能"

-#: ../../user_guide/support_matrix/supported_features.md:3
+#: ../../source/user_guide/support_matrix/supported_features.md:3
 msgid ""
-"The feature support principle of vLLM Ascend is: **aligned with the vLLM**. "
-"We are also actively collaborating with the community to accelerate support."
-msgstr "vLLM Ascend 的特性支持原则是：**与 vLLM 保持一致**。我们也在积极与社区合作，加快支持进度。"
+"The feature support principle of vLLM Ascend is: **aligned with vLLM**. "
+"We are also actively collaborating with the community to accelerate "
+"support."
+msgstr "vLLM Ascend 的功能支持原则是：**与 vLLM 保持一致**。我们也在积极与社区合作，以加快支持进度。"

-#: ../../user_guide/support_matrix/supported_features.md:5
+#: ../../source/user_guide/support_matrix/supported_features.md:5
+msgid "Functional call: <https://docs.vllm.ai/en/latest/features/tool_calling/>"
+msgstr "函数调用：<https://docs.vllm.ai/en/latest/features/tool_calling/>"
+
+#: ../../source/user_guide/support_matrix/supported_features.md:7
 msgid ""
-"You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below "
-"is the feature support status of vLLM Ascend:"
-msgstr "你可以查看 [vLLM V1 引擎的支持状态][v1_user_guide]。下面是 vLLM Ascend 的功能支持情况："
+"You can check the [support status of vLLM V1 Engine][v1_user_guide]. "
+"Below is the feature support status of vLLM Ascend:"
+msgstr "您可以查看 [vLLM V1 引擎的支持状态][v1_user_guide]。以下是 vLLM Ascend 的功能支持状态："

-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "Feature"
-msgstr "特性"
+msgstr "功能"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "vLLM V0 Engine"
-msgstr "vLLM V0 引擎"
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Status"
+msgstr "状态"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "vLLM V1 Engine"
-msgstr "vLLM V1 引擎"
-
-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "Next Step"
-msgstr "下一步"
+msgstr "后续步骤"

-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "Chunked Prefill"
 msgstr "分块预填充"

-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "🟢 Functional"
-msgstr "🟢 功能性"
+msgstr "🟢 功能完备"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Functional, see detail note: [Chunked Prefill][cp]"
-msgstr "功能性，详见说明：[分块预填充][cp]"
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, see detailed note: [Chunked Prefill][cp]"
+msgstr "功能完备，详见说明：[分块预填充][cp]"

-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "Automatic Prefix Caching"
 msgstr "自动前缀缓存"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Functional, see detail note: [vllm-ascend#732][apc]"
-msgstr "可用，请参见详细说明：[vllm-ascend#732][apc]"
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, see detailed note: [vllm-ascend#732][apc]"
+msgstr "功能完备，详见说明：[vllm-ascend#732][apc]"

-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "LoRA"
 msgstr "LoRA"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "[vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora]"
-msgstr "[vllm-ascend#396][multilora]，[vllm-ascend#893][v1 multilora]"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Prompt adapter"
-msgstr "提示适配器"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "🔴 No plan"
-msgstr "🔴 无计划"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "This feature has been deprecated by vllm."
-msgstr "此功能已被 vllm 弃用。"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Speculative decoding"
-msgstr "猜测式解码"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Basic support"
-msgstr "基础支持"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Pooling"
-msgstr "池化"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "🟡 Planned"
-msgstr "🟡 计划中"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "CI needed and adapting more models; V1 support rely on vLLM support."
-msgstr "需要持续集成（CI）并适配更多模型；V1 的支持依赖于 vLLM 的支持。"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Enc-dec"
-msgstr "Enc-dec（编码-解码）"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "🔴 NO plan"
-msgstr "🔴 没有计划"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Plan in 2025.06.30"
-msgstr "2025.06.30 的计划"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Multi Modality"
-msgstr "多模态"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "[Tutorial][multimodal], optimizing and adapting more models"
-msgstr "[教程][multimodal]，优化和适配更多模型"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "LogProbs"
-msgstr "LogProbs"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "CI needed"
-msgstr "需要持续集成（CI）"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Prompt logProbs"
-msgstr "提示 logProbs"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Async output"
-msgstr "异步输出"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Multi step scheduler"
-msgstr "多步调度器"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "🔴 Deprecated"
-msgstr "🔴 已弃用"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "[vllm#8779][v1_rfc], replaced by [vLLM V1 Scheduler][v1_scheduler]"
-msgstr "[vllm#8779][v1_rfc]，已被 [vLLM V1 调度器][v1_scheduler] 替代"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Best of"
-msgstr "精选"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "[vllm#13361][best_of], CI needed"
-msgstr "[vllm#13361][best_of]，需要持续集成（CI）"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Beam search"
-msgstr "束搜索"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Guided Decoding"
-msgstr "引导解码"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "[vllm-ascend#177][guided_decoding]"
-msgstr "[vllm-ascend#177][guided_decoding]"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Tensor Parallel"
-msgstr "张量并行"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Pipeline Parallel"
-msgstr "流水线并行"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Expert Parallel"
-msgstr "专家并行"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "CI needed; No plan on V0 support"
-msgstr "需要持续集成；没有支持V0的计划"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Data Parallel"
-msgstr "数据并行"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "CI needed;  No plan on V0 support"
-msgstr "需要 CI；暂无 V0 支持计划"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Prefill Decode Disaggregation"
-msgstr "预填充 解码 拆分"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "1P1D available, working on xPyD and V1 support."
-msgstr "1P1D 已可用，正在开发 xPyD 和 V1 支持。"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Quantization"
-msgstr "量化"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "W8A8 available, CI needed; working on more quantization method support"
-msgstr "W8A8 已可用，需要持续集成（CI）；正在开发对更多量化方法的支持。"
-
-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Graph Mode"
-msgstr "图模式"
-
-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "🔵 Experimental"
 msgstr "🔵 实验性"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "Experimental, see detail note: [vllm-ascend#767][graph_mode]"
-msgstr "实验性功能，详见说明：[vllm-ascend#767][graph_mode]"
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, see detailed note: [LoRA][LoRA]"
+msgstr "功能完备，详见说明：[LoRA][LoRA]"

-#: ../../user_guide/support_matrix/supported_features.md
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Speculative decoding"
+msgstr "推测解码"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Basic support"
+msgstr "基础支持"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Pooling"
+msgstr "池化"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "CI needed to adapt to more models; V1 support relies on vLLM support."
+msgstr "需要 CI 以适配更多模型；V1 支持依赖于 vLLM 的支持。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Enc-dec"
+msgstr "编码器-解码器"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "🟡 Planned"
+msgstr "🟡 计划中"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "vLLM should support this feature first."
+msgstr "vLLM 需要首先支持此功能。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Multi Modality"
+msgstr "多模态"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "[Multi Modality][multimodal], optimizing and adapting more models"
+msgstr "[多模态][multimodal]，优化和适配更多模型"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "LogProbs"
+msgstr "LogProbs"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "CI needed"
+msgstr "需要 CI"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Prompt logProbs"
+msgstr "提示词 LogProbs"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Async output"
+msgstr "异步输出"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Beam search"
+msgstr "束搜索"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Guided Decoding"
+msgstr "引导解码"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "[vllm-ascend#177][guided_decoding]"
+msgstr "[vllm-ascend#177][guided_decoding]"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Tensor Parallel"
+msgstr "张量并行"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Make TP >4 work with graph mode."
+msgstr "使 TP >4 能在图模式下工作。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Pipeline Parallel"
+msgstr "流水线并行"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Write official guide and tutorial."
+msgstr "编写官方指南和教程。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Expert Parallel"
+msgstr "专家并行"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Support dynamic EPLB."
+msgstr "支持动态 EPLB。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Data Parallel"
+msgstr "数据并行"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Data Parallel support for Qwen3 MoE."
+msgstr "为 Qwen3 MoE 提供数据并行支持。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Prefill Decode Disaggregation"
+msgstr "预填充解码分离"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, xPyD is supported."
+msgstr "功能完备，支持 xPyD。"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Quantization"
+msgstr "量化"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "W8A8 available; working on more quantization method support (W4A8, etc)"
+msgstr "W8A8 已可用；正在开发对更多量化方法（如 W4A8 等）的支持"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Graph Mode"
+msgstr "图模式"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, see detailed note: [Graph Mode][graph_mode]"
+msgstr "功能完备，详见说明：[图模式][graph_mode]"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
 msgid "Sleep Mode"
-msgstr "睡眠模式"
+msgstr "休眠模式"

-#: ../../user_guide/support_matrix/supported_features.md
-msgid "level=1 available, CI needed, working on V1 support"
-msgstr "level=1 可用，需要CI，正在开发 V1 支持"
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, see detailed note: [Sleep Mode][sleep_mode]"
+msgstr "功能完备，详见说明：[休眠模式][sleep_mode]"

-#: ../../user_guide/support_matrix/supported_features.md:33
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Context Parallel"
+msgstr "上下文并行"
+
+#: ../../source/user_guide/support_matrix/supported_features.md
+msgid "Functional, see detailed note: [Context Parallel][context_parallel]"
+msgstr "功能完备，详见说明：[上下文并行][context_parallel]"
+
+#: ../../source/user_guide/support_matrix/supported_features.md:33
 msgid "🟢 Functional: Fully operational, with ongoing optimizations."
-msgstr "🟢 功能性：完全可用，正在持续优化中。"
+msgstr "🟢 功能完备：完全可用，正在持续优化中。"

-#: ../../user_guide/support_matrix/supported_features.md:34
-msgid ""
-"🔵 Experimental: Experimental support, interfaces and functions may change."
-msgstr "🔵 实验性：实验性支持，接口和功能可能会发生变化。"
+#: ../../source/user_guide/support_matrix/supported_features.md:34
+msgid "🔵 Experimental: Experimental support, interfaces and functions may change."
+msgstr "🔵 实验性：实验性支持，接口和功能可能发生变化。"

-#: ../../user_guide/support_matrix/supported_features.md:35
+#: ../../source/user_guide/support_matrix/supported_features.md:35
 msgid "🚧 WIP: Under active development, will be supported soon."
-msgstr "🚧 WIP：正在积极开发中，很快将会支持。"
+msgstr "🚧 开发中：正在积极开发，即将支持。"

-#: ../../user_guide/support_matrix/supported_features.md:36
+#: ../../source/user_guide/support_matrix/supported_features.md:36
 msgid ""
 "🟡 Planned: Scheduled for future implementation (some may have open "
 "PRs/RFCs)."
-msgstr "🟡 计划中：已安排将来实现（其中一些可能已有开放的PR/RFC）。"
+msgstr "🟡 计划中：计划在未来实现（部分可能已有开放的 PR/RFC）。"

-#: ../../user_guide/support_matrix/supported_features.md:37
-msgid "🔴 NO plan / Deprecated: No plan for V0 or deprecated by vLLM v1."
-msgstr "🔴 没有计划 / 已弃用：V0 没有计划或已被 vLLM v1 弃用。"
+#: ../../source/user_guide/support_matrix/supported_features.md:37
+msgid "🔴 NO plan/Deprecated: No plan or deprecated by vLLM."
+msgstr "🔴 无计划/已弃用：暂无计划或已被 vLLM 弃用。"
+
+#~ msgid "Feature Support"
+#~ msgstr "功能支持"
+
+#~ msgid "vLLM V0 Engine"
+#~ msgstr "vLLM V0 引擎"
+
+#~ msgid "vLLM V1 Engine"
+#~ msgstr "vLLM V1 引擎"
+
+#~ msgid "[vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora]"
+#~ msgstr "[vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora]"
+
+#~ msgid "Prompt adapter"
+#~ msgstr "提示词适配器"
+
+#~ msgid "🔴 No plan"
+#~ msgstr "🔴 无计划"
+
+#~ msgid "This feature has been deprecated by vllm."
+#~ msgstr "此功能已被 vllm 弃用。"
+
+#~ msgid "🔴 NO plan"
+#~ msgstr "🔴 无计划"
+
+#~ msgid "Plan in 2025.06.30"
+#~ msgstr "计划于 2025.06.30"
+
+#~ msgid "Multi step scheduler"
+#~ msgstr "多步调度器"
+
+#~ msgid "🔴 Deprecated"
+#~ msgstr "🔴 已弃用"
+
+#~ msgid "[vllm#8779][v1_rfc], replaced by [vLLM V1 Scheduler][v1_scheduler]"
+#~ msgstr "[vllm#8779][v1_rfc]，已被 [vLLM V1 调度器][v1_scheduler] 取代"
+
+#~ msgid "Best of"
+#~ msgstr "最佳结果"
+
+#~ msgid "[vllm#13361][best_of], CI needed"
+#~ msgstr "[vllm#13361][best_of]，需要 CI"
+
+#~ msgid "CI needed; No plan on V0 support"
+#~ msgstr "需要 CI；暂无 V0 支持计划"
+
+#~ msgid "CI needed;  No plan on V0 support"
+#~ msgstr "需要 CI；暂无 V0 支持计划"
+
+#~ msgid "1P1D available, working on xPyD and V1 support."
+#~ msgstr "1P1D 已可用，正在开发 xPyD 和 V1 支持。"
+
+#~ msgid "Experimental, see detail note: [vllm-ascend#767][graph_mode]"
+#~ msgstr "实验性，详见说明：[vllm-ascend#767][graph_mode]"
+
+#~ msgid "level=1 available, CI needed, working on V1 support"
+#~ msgstr "level=1 已可用，需要 CI，正在开发 V1 支持"
--- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po
+++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po
@@ -4,187 +4,620 @@
 # package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
 #
-#, fuzzy
 msgid ""
 msgstr ""
-"Project-Id-Version: vllm-ascend\n"
+"Project-Id-Version:  vllm-ascend\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-07-18 09:01+0800\n"
+"POT-Creation-Date: 2026-04-14 09:08+0000\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language-Team: zh_CN <LL@li.org>\n"
 "Language: zh_CN\n"
+"Language-Team: zh_CN <LL@li.org>\n"
+"Plural-Forms: nplurals=1; plural=0;\n"
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"Generated-By: Babel 2.17.0\n"
+"Generated-By: Babel 2.18.0\n"

-#: ../../user_guide/support_matrix/supported_models.md:1
-msgid "Model Support"
-msgstr "模型支持"
+#: ../../source/user_guide/support_matrix/supported_models.md:1
+msgid "Supported Models"
+msgstr "支持的模型"

-#: ../../user_guide/support_matrix/supported_models.md:3
-msgid "Text-only Language Models"
+#: ../../source/user_guide/support_matrix/supported_models.md:3
+msgid ""
+"Get the latest info here: <https://github.com/vllm-project/vllm-"
+"ascend/issues/1608>"
+msgstr "获取最新信息请访问：<https://github.com/vllm-project/vllm-ascend/issues/1608>"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:5
+msgid "**Legend Description**:"
+msgstr "**图例说明**："
+
+#: ../../source/user_guide/support_matrix/supported_models.md:7
+msgid "✅ = Supported model/feature"
+msgstr "✅ = 支持的模型/功能"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:8
+msgid "🔵 = Experimental supported model/feature"
+msgstr "🔵 = 实验性支持的模型/功能"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:9
+msgid "❌ = Not supported model/feature"
+msgstr "❌ = 不支持的模型/功能"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:10
+msgid "🟡 = Not tested or verified"
+msgstr "🟡 = 未测试或未验证"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:12
+msgid "Text-Only Language Models"
 msgstr "纯文本语言模型"

-#: ../../user_guide/support_matrix/supported_models.md:5
-#: ../../user_guide/support_matrix/supported_models.md:38
+#: ../../source/user_guide/support_matrix/supported_models.md:14
+#: ../../source/user_guide/support_matrix/supported_models.md:74
 msgid "Generative Models"
 msgstr "生成模型"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md:16
+#: ../../source/user_guide/support_matrix/supported_models.md:76
+msgid "Core Supported Models"
+msgstr "核心支持的模型"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Model"
 msgstr "模型"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Supported"
-msgstr "支持"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Support"
+msgstr "支持状态"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Note"
-msgstr "注释"
+msgstr "备注"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "DeepSeek v3"
-msgstr "DeepSeek v3"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "BF16"
+msgstr "BF16"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Supported Hardware"
+msgstr "支持的硬件"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "W8A8"
+msgstr "W8A8"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Chunked Prefill"
+msgstr "分块预填充"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Automatic Prefix Cache"
+msgstr "自动前缀缓存"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "LoRA"
+msgstr "LoRA"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Speculative Decoding"
+msgstr "推测解码"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Async Scheduling"
+msgstr "异步调度"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Tensor Parallel"
+msgstr "张量并行"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Pipeline Parallel"
+msgstr "流水线并行"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Expert Parallel"
+msgstr "专家并行"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Data Parallel"
+msgstr "数据并行"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Prefill-decode Disaggregation"
+msgstr "预填充-解码解耦"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Piecewise AclGraph"
+msgstr "分段 AclGraph"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Fullgraph AclGraph"
+msgstr "全图 AclGraph"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "max-model-len"
+msgstr "最大模型长度"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MLP Weight Prefetch"
+msgstr "MLP 权重预取"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Doc"
+msgstr "文档"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "DeepSeek V3/3.1"
+msgstr "DeepSeek V3/3.1"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "✅"
 msgstr "✅"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "A2/A3"
+msgstr "A2/A3"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "240k"
+msgstr "240k"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[DeepSeek-V3.1](../../tutorials/models/DeepSeek-V3.1.md)"
+msgstr "[DeepSeek-V3.1](../../tutorials/models/DeepSeek-V3.1.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "DeepSeek V3.2"
+msgstr "DeepSeek V3.2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "🔵"
+msgstr "🔵"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "160k"
+msgstr "160k"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[DeepSeek-V3.2](../../tutorials/models/DeepSeek-V3.2.md)"
+msgstr "[DeepSeek-V3.2](../../tutorials/models/DeepSeek-V3.2.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "DeepSeek R1"
 msgstr "DeepSeek R1"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "DeepSeek Distill (Qwen/LLama)"
-msgstr "DeepSeek 精炼（Qwen/LLama）"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "128k"
+msgstr "128k"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[DeepSeek-R1](../../tutorials/models/DeepSeek-R1.md)"
+msgstr "[DeepSeek-R1](../../tutorials/models/DeepSeek-R1.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Qwen3"
 msgstr "Qwen3"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3-Dense](../../tutorials/models/Qwen3-Dense.md)"
+msgstr "[Qwen3-Dense](../../tutorials/models/Qwen3-Dense.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-Coder"
+msgstr "Qwen3-Coder"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid ""
+"[Qwen3-Coder-30B-A3B tutorial](../../tutorials/models/Qwen3-Coder-30B-"
+"A3B.md)"
+msgstr "[Qwen3-Coder-30B-A3B 教程](../../tutorials/models/Qwen3-Coder-30B-A3B.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Qwen3-Moe"
 msgstr "Qwen3-Moe"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "256k"
+msgstr "256k"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3-235B-A22B](../../tutorials/models/Qwen3-235B-A22B.md)"
+msgstr "[Qwen3-235B-A22B](../../tutorials/models/Qwen3-235B-A22B.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-Next"
+msgstr "Qwen3-Next"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3-Next](../../tutorials/models/Qwen3-Next.md)"
+msgstr "[Qwen3-Next](../../tutorials/models/Qwen3-Next.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Qwen2.5"
 msgstr "Qwen2.5"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "QwQ-32B"
-msgstr "QwQ-32B"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen2.5-7B](../../tutorials/models/Qwen2.5-7B.md)"
+msgstr "[Qwen2.5-7B](../../tutorials/models/Qwen2.5-7B.md)"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "LLama3.1/3.2"
-msgstr "LLama3.1/3.2"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "GLM-4.x"
+msgstr "GLM-4.x"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Internlm"
-msgstr "Internlm"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "198k"
+msgstr "198k"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Baichuan"
-msgstr "百川"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[GLM-4.x](../../tutorials/models/GLM4.x.md)"
+msgstr "[GLM-4.x](../../tutorials/models/GLM4.x.md)"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Phi-4-mini"
-msgstr "Phi-4-mini"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "GLM-5"
+msgstr "GLM-5"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "MiniCPM"
-msgstr "MiniCPM"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[GLM-5](../../tutorials/models/GLM5.md)"
+msgstr "[GLM-5](../../tutorials/models/GLM5.md)"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "MiniCPM3"
-msgstr "MiniCPM3"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Kimi-K2-Thinking"
+msgstr "Kimi-K2-Thinking"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "LLama4"
-msgstr "LLama4"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Kimi-K2-Thinking](../../tutorials/models/Kimi-K2-Thinking.md)"
+msgstr "[Kimi-K2-Thinking](../../tutorials/models/Kimi-K2-Thinking.md)"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Mistral"
-msgstr "Mistral"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MiniMax-M2.5"
+msgstr "MiniMax-M2.5"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Need test"
-msgstr "需要测试"
-
-#: ../../user_guide/support_matrix/supported_models.md
-msgid "DeepSeek v2.5"
-msgstr "DeepSeek v2.5"
-
-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Gemma-2"
-msgstr "Gemma-2"
-
-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Mllama"
-msgstr "Mllama"
-
-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Gemma-3"
-msgstr "Gemma-3"
-
-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "❌"
 msgstr "❌"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "[#496](https://github.com/vllm-project/vllm-ascend/issues/496)"
-msgstr "[#496](https://github.com/vllm-project/vllm-ascend/issues/496)"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "🟡"
+msgstr "🟡"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "ChatGLM"
-msgstr "ChatGLM"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "192k"
+msgstr "192k"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "[#554](https://github.com/vllm-project/vllm-ascend/issues/554)"
-msgstr "[#554](https://github.com/vllm-project/vllm-ascend/issues/554)"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[MiniMax-M2.5](../../tutorials/models/MiniMax-M2.md)"
+msgstr "[MiniMax-M2.5](../../tutorials/models/MiniMax-M2.md)"

-#: ../../user_guide/support_matrix/supported_models.md:29
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MiniMax-M2.7"
+msgstr "MiniMax-M2.7"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[MiniMax-M2.7](../../tutorials/models/MiniMax-M2.md)"
+msgstr "[MiniMax-M2.7](../../tutorials/models/MiniMax-M2.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:34
+#: ../../source/user_guide/support_matrix/supported_models.md:88
+msgid "Extended Compatible Models"
+msgstr "扩展兼容模型"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "DeepSeek Distill (Qwen/Llama)"
+msgstr "DeepSeek Distill (Qwen/Llama)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-based"
+msgstr "基于 Qwen3"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen2"
+msgstr "Qwen2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen2-based"
+msgstr "基于 Qwen2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "QwQ-32B"
+msgstr "QwQ-32B"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Llama2/3/3.1/3.2"
+msgstr "Llama2/3/3.1/3.2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Internlm"
+msgstr "Internlm"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[#1962](https://github.com/vllm-project/vllm-ascend/issues/1962)"
+msgstr "[#1962](https://github.com/vllm-project/vllm-ascend/issues/1962)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Baichuan"
+msgstr "Baichuan"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Baichuan2"
+msgstr "Baichuan2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Phi-4-mini"
+msgstr "Phi-4-mini"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MiniCPM"
+msgstr "MiniCPM"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MiniCPM3"
+msgstr "MiniCPM3"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Ernie4.5"
+msgstr "Ernie4.5"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Ernie4.5-Moe"
+msgstr "Ernie4.5-Moe"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Gemma-2"
+msgstr "Gemma-2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Gemma-3"
+msgstr "Gemma-3"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Phi-3/4"
+msgstr "Phi-3/4"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Mistral/Mistral-Instruct"
+msgstr "Mistral/Mistral-Instruct"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "DeepSeek V2.5"
+msgstr "DeepSeek V2.5"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Need test"
+msgstr "需要测试"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Mllama"
+msgstr "Mllama"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MiniMax-Text"
+msgstr "MiniMax-Text"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:60
 msgid "Pooling Models"
 msgstr "池化模型"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "XLM-RoBERTa-based"
-msgstr "基于XLM-RoBERTa"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-Embedding"
+msgstr "Qwen3-Embedding"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3_embedding](../../tutorials/models/Qwen3_embedding.md)"
+msgstr "[Qwen3_embedding](../../tutorials/models/Qwen3_embedding.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-VL-Embedding"
+msgstr "Qwen3-VL-Embedding"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3-VL-Embedding](../../tutorials/models/Qwen3-VL-Embedding.md)"
+msgstr "[Qwen3-VL-Embedding](../../tutorials/models/Qwen3-VL-Embedding.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-Reranker"
+msgstr "Qwen3-Reranker"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3_reranker](../../tutorials/models/Qwen3_reranker.md)"
+msgstr "[Qwen3_reranker](../../tutorials/models/Qwen3_reranker.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-VL-Reranker"
+msgstr "Qwen3-VL-Reranker"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3-VL-Reranker](../../tutorials/models/Qwen3-VL-Reranker.md)"
+msgstr "[Qwen3-VL-Reranker](../../tutorials/models/Qwen3-VL-Reranker.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Molmo"
 msgstr "Molmo"

-#: ../../user_guide/support_matrix/supported_models.md:36
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[1942](https://github.com/vllm-project/vllm-ascend/issues/1942)"
+msgstr "[1942](https://github.com/vllm-project/vllm-ascend/issues/1942)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "XLM-RoBERTa-based"
+msgstr "基于XLM-RoBERTa"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Bert"
+msgstr "Bert"
+
+#: ../../source/user_guide/support_matrix/supported_models.md:72
 msgid "Multimodal Language Models"
 msgstr "多模态语言模型"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Qwen2-VL"
-msgstr "Qwen2-VL"
-
-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Qwen2.5-VL"
 msgstr "Qwen2.5-VL"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "30k"
+msgstr "30k"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen-VL-Dense](../../tutorials/models/Qwen-VL-Dense.md)"
+msgstr "[Qwen-VL-Dense](../../tutorials/models/Qwen-VL-Dense.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-VL"
+msgstr "Qwen3-VL"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-VL-MOE"
+msgstr "Qwen3-VL-MOE"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3-VL-MOE](../../tutorials/models/Qwen3-VL-235B-A22B-Instruct.md)"
+msgstr "[Qwen3-VL-MOE](../../tutorials/models/Qwen3-VL-235B-A22B-Instruct.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3.5-397B-A17B"
+msgstr "Qwen3.5-397B-A17B"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "1010000"
+msgstr "1010000"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3.5-397B-A17B](../../tutorials/models/Qwen3.5-397B-A17B.md)"
+msgstr "[Qwen3.5-397B-A17B](../../tutorials/models/Qwen3.5-397B-A17B.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3.5-27B"
+msgstr "Qwen3.5-27B"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen3.5-27B](../../tutorials/models/Qwen3.5-27B.md)"
+msgstr "[Qwen3.5-27B](../../tutorials/models/Qwen3.5-27B.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-Omni-30B-A3B-Thinking"
+msgstr "Qwen3-Omni-30B-A3B-Thinking"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid ""
+"[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/models/Qwen3-Omni-30B-A3B-"
+"Thinking.md)"
+msgstr "[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen2.5-Omni"
+msgstr "Qwen2.5-Omni"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[Qwen2.5-Omni](../../tutorials/models/Qwen2.5-Omni.md)"
+msgstr "[Qwen2.5-Omni](../../tutorials/models/Qwen2.5-Omni.md)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen2-VL"
+msgstr "Qwen2-VL"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen3-Omni"
+msgstr "Qwen3-Omni"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "QVQ"
+msgstr "QVQ"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Qwen2-Audio"
+msgstr "Qwen2-Audio"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Aria"
+msgstr "Aria"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "LLaVA-Next"
 msgstr "LLaVA-Next"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "LLaVA-Next-Video"
 msgstr "LLaVA-Next-Video"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "Phi-3-Vison/Phi-3.5-Vison"
-msgstr "Phi-3-Vison/Phi-3.5-Vison"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "MiniCPM-V"
+msgstr "MiniCPM-V"

-#: ../../user_guide/support_matrix/supported_models.md
-msgid "GLM-4v"
-msgstr "GLM-4v"
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Mistral3"
+msgstr "Mistral3"

-#: ../../user_guide/support_matrix/supported_models.md
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Phi-3-Vision/Phi-3.5-Vision"
+msgstr "Phi-3-Vision/Phi-3.5-Vision"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Gemma3"
+msgstr "Gemma3"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Llama3.2"
+msgstr "Llama3.2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "PaddleOCR-VL"
+msgstr "PaddleOCR-VL"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Llama4"
+msgstr "Llama4"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[1972](https://github.com/vllm-project/vllm-ascend/issues/1972)"
+msgstr "[1972](https://github.com/vllm-project/vllm-ascend/issues/1972)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Keye-VL-8B-Preview"
+msgstr "Keye-VL-8B-Preview"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[1963](https://github.com/vllm-project/vllm-ascend/issues/1963)"
+msgstr "[1963](https://github.com/vllm-project/vllm-ascend/issues/1963)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Florence-2"
+msgstr "Florence-2"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[2259](https://github.com/vllm-project/vllm-ascend/issues/2259)"
+msgstr "[2259](https://github.com/vllm-project/vllm-ascend/issues/2259)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "GLM-4V"
+msgstr "GLM-4V"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[2260](https://github.com/vllm-project/vllm-ascend/issues/2260)"
+msgstr "[2260](https://github.com/vllm-project/vllm-ascend/issues/2260)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "InternVL2.0/2.5/3.0<br>InternVideo2.5/Mono-InternVL"
+msgstr "InternVL2.0/2.5/3.0<br>InternVideo2.5/Mono-InternVL"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[2064](https://github.com/vllm-project/vllm-ascend/issues/2064)"
+msgstr "[2064](https://github.com/vllm-project/vllm-ascend/issues/2064)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "Whisper"
+msgstr "Whisper"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
+msgid "[2262](https://github.com/vllm-project/vllm-ascend/issues/2262)"
+msgstr "[2262](https://github.com/vllm-project/vllm-ascend/issues/2262)"
+
+#: ../../source/user_guide/support_matrix/supported_models.md
 msgid "Ultravox"
 msgstr "Ultravox"
+
+#~ msgid "Model Support"
+#~ msgstr "模型支持"
+
+#~ msgid "ChatGLM"
+#~ msgstr "ChatGLM"