[v0.18.0][Doc] Translated Doc files 2026-04-22 (#8565)
## Auto-Translation Summary Translated **43** file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24767290887) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
This commit is contained in:
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -119,6 +119,10 @@ msgstr "昇腾编译的配置选项"
|
||||
msgid "`eplb_config`"
|
||||
msgstr "`eplb_config`"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "Configuration options for eplb"
|
||||
msgstr "eplb 的配置选项"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "`refresh`"
|
||||
msgstr "`refresh`"
|
||||
@@ -203,8 +207,12 @@ msgid "`recompute_scheduler_enable`"
|
||||
msgstr "`recompute_scheduler_enable`"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "Whether to enable recompute scheduler."
|
||||
msgstr "是否启用重计算调度器。"
|
||||
msgid ""
|
||||
"Whether to enable the recompute scheduler. **Only valid in PD-"
|
||||
"disaggregated mode** (`kv_role` is `kv_producer` or `kv_consumer`). **Do "
|
||||
"not enable in PD-mixed mode** (no `kv_transfer_config`, or `kv_role` is "
|
||||
"`kv_both`); startup will fail with a clear error."
|
||||
msgstr "是否启用重计算调度器。**仅在 PD 解耦模式下有效**(`kv_role` 为 `kv_producer` 或 `kv_consumer`)。**请勿在 PD 混合模式下启用**(无 `kv_transfer_config`,或 `kv_role` 为 `kv_both`);启动时将失败并显示明确的错误信息。"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "`enable_cpu_binding`"
|
||||
@@ -347,7 +355,9 @@ msgstr "`prefetch_ratio`"
|
||||
msgid ""
|
||||
"`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, "
|
||||
"\"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}`"
|
||||
msgstr "`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, \"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}`"
|
||||
msgstr ""
|
||||
"`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, "
|
||||
"\"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}`"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "Prefetch ratio of each weight."
|
||||
@@ -519,3 +529,6 @@ msgstr "示例"
|
||||
#: ../../source/user_guide/configuration/additional_config.md:99
|
||||
msgid "An example of additional configuration is as follows:"
|
||||
msgstr "以下是额外配置的一个示例:"
|
||||
|
||||
#~ msgid "Whether to enable recompute scheduler."
|
||||
#~ msgstr "是否启用重计算调度器。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -230,12 +230,12 @@ msgstr "实验结果"
|
||||
msgid ""
|
||||
"To evaluate the effectiveness of fine-grained TP in large-scale service "
|
||||
"scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated "
|
||||
"decode instances in an environment of 32 cards Ascend 910B*64G (A2), with"
|
||||
" parallel configuration as DP32+EP32, and fine-grained TP size of 8; the "
|
||||
"performance data is as follows."
|
||||
"decode instances in an environment of 32 cards Ascend Atlas A2 inference "
|
||||
"products*64G (A2), with parallel configuration as DP32+EP32, and fine-"
|
||||
"grained TP size of 8; the performance data is as follows."
|
||||
msgstr ""
|
||||
"为评估细粒度 TP 在大规模服务场景中的有效性,我们使用模型 **DeepSeek-R1-W8A8**,在 32 卡 Ascend "
|
||||
"910B*64G (A2) 环境中部署 PD 分离的解码实例,并行配置为 DP32+EP32,细粒度 TP 规模为 8;性能数据如下。"
|
||||
"Atlas A2 推理产品*64G (A2) 环境中部署 PD 分离的解码实例,并行配置为 DP32+EP32,细粒度 TP 规模为 8;性能数据如下。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
|
||||
msgid "Module"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,16 +29,15 @@ msgid ""
|
||||
"active development. Track progress and planned improvements at "
|
||||
"<https://github.com/vllm-project/vllm-ascend/issues/5487>"
|
||||
msgstr ""
|
||||
"批次不变性功能目前处于测试阶段。部分功能仍在积极开发中。请通过 "
|
||||
"<https://github.com/vllm-project/vllm-ascend/issues/5487> 跟踪进展和计划改进。"
|
||||
"批次不变性功能目前处于测试阶段。部分功能仍在积极开发中。请通过 <https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/5487> 跟踪进展和计划改进。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:8
|
||||
msgid ""
|
||||
"This document shows how to enable batch invariance in vLLM-Ascend. Batch "
|
||||
"invariance ensures that the output of a model is deterministic and "
|
||||
"independent of the batch size or the order of requests in a batch."
|
||||
msgstr ""
|
||||
"本文档介绍如何在 vLLM-Ascend 中启用批次不变性。批次不变性确保模型的输出是确定性的,且不依赖于批次大小或批次中请求的顺序。"
|
||||
msgstr "本文档介绍如何在 vLLM-Ascend 中启用批次不变性。批次不变性确保模型的输出是确定性的,且不依赖于批次大小或批次中请求的顺序。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:10
|
||||
msgid "Motivation"
|
||||
@@ -53,8 +52,7 @@ msgid ""
|
||||
"**Framework debugging**: Deterministic outputs make it easier to debug "
|
||||
"issues in the inference framework, as the same input will always produce "
|
||||
"the same output regardless of batching."
|
||||
msgstr ""
|
||||
"**框架调试**:确定性输出使得调试推理框架中的问题更加容易,因为无论批处理方式如何,相同的输入总是产生相同的输出。"
|
||||
msgstr "**框架调试**:确定性输出使得调试推理框架中的问题更加容易,因为无论批处理方式如何,相同的输入总是产生相同的输出。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:15
|
||||
msgid ""
|
||||
@@ -81,11 +79,11 @@ msgstr "硬件要求"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:21
|
||||
msgid ""
|
||||
"Batch invariance currently requires Ascend 910B NPUs, because only the "
|
||||
"910B supports batch invariance with HCCL communication for now. We will "
|
||||
"support other NPUs in the future."
|
||||
msgstr ""
|
||||
"批次不变性目前需要 Ascend 910B NPU,因为目前只有 910B 支持通过 HCCL 通信实现批次不变性。我们未来将支持其他 NPU。"
|
||||
"Batch invariance currently requires Ascend Atlas A2 inference products "
|
||||
"NPUs, because only the Atlas A2 inference products supports batch "
|
||||
"invariance with HCCL communication for now. We will support other NPUs in"
|
||||
" the future."
|
||||
msgstr "批次不变性目前需要 Ascend Atlas A2 推理产品 NPU,因为目前只有 Atlas A2 推理产品支持通过 HCCL 通信实现批次不变性。我们未来将支持其他 NPU。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:24
|
||||
msgid "Software Requirements"
|
||||
@@ -93,9 +91,10 @@ msgstr "软件要求"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:26
|
||||
msgid ""
|
||||
"Batch invariance requires a customed operator library for 910B. We will "
|
||||
"release the customed operator library in future versions."
|
||||
msgstr "批次不变性需要为 910B 定制的算子库。我们将在未来版本中发布该定制算子库。"
|
||||
"Batch invariance requires a custom operator library for Atlas A2 "
|
||||
"inference products. We will release the customed operator library in "
|
||||
"future versions."
|
||||
msgstr "批次不变性需要为 Atlas A2 推理产品定制的算子库。我们将在未来版本中发布该定制算子库。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:29
|
||||
msgid "Enabling Batch Invariance"
|
||||
@@ -150,7 +149,9 @@ msgid ""
|
||||
"[GitHub issue tracker](https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/new/choose)."
|
||||
msgstr ""
|
||||
"其他模型也可能适用,但上述模型已明确经过验证。如果您在使用特定模型时遇到问题,请在 [GitHub 问题跟踪器](https://github.com/vllm-project/vllm-ascend/issues/new/choose) 上报告。"
|
||||
"其他模型也可能适用,但上述模型已明确经过验证。如果您在使用特定模型时遇到问题,请在 [GitHub "
|
||||
"问题跟踪器](https://github.com/vllm-project/vllm-ascend/issues/new/choose) "
|
||||
"上报告。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:114
|
||||
msgid "Implementation Details"
|
||||
@@ -211,4 +212,6 @@ msgstr "额外的测试和验证"
|
||||
msgid ""
|
||||
"For the latest status and to contribute ideas, see the [tracking "
|
||||
"issue](https://github.com/vllm-project/vllm-ascend/issues/5487)."
|
||||
msgstr "有关最新状态和贡献想法,请参阅 [跟踪问题](https://github.com/vllm-project/vllm-ascend/issues/5487)。"
|
||||
msgstr ""
|
||||
"有关最新状态和贡献想法,请参阅 [跟踪问题](https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/5487)。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -34,7 +34,8 @@ msgid ""
|
||||
"Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory "
|
||||
"usage and improves inference speed in long sequence LLM inference."
|
||||
msgstr ""
|
||||
"本指南介绍如何使用上下文并行(Context Parallel),一种长序列推理优化技术。上下文并行包括 `PCP`(预填充上下文并行)和 `DCP`(解码上下文并行),可减少长序列LLM推理中的NPU内存使用并提升推理速度。"
|
||||
"本指南介绍如何使用上下文并行(Context Parallel),一种长序列推理优化技术。上下文并行包括 `PCP`(预填充上下文并行)和 "
|
||||
"`DCP`(解码上下文并行),可减少长序列LLM推理中的NPU内存使用并提升推理速度。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:7
|
||||
msgid "Benefits of Context Parallel"
|
||||
@@ -47,32 +48,28 @@ msgid ""
|
||||
"and have quite different SLO (service level objectives), we need to "
|
||||
"implement context parallel separately for them. The major considerations "
|
||||
"are:"
|
||||
msgstr ""
|
||||
"上下文并行主要解决服务长上下文请求的问题。由于预填充和解码阶段具有截然不同的特性以及不同的服务级别目标(SLO),我们需要分别为它们实现上下文并行。主要考虑点如下:"
|
||||
msgstr "上下文并行主要解决服务长上下文请求的问题。由于预填充和解码阶段具有截然不同的特性以及不同的服务级别目标(SLO),我们需要分别为它们实现上下文并行。主要考虑点如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:11
|
||||
msgid ""
|
||||
"For long context prefill, we can use context parallel to reduce TTFT "
|
||||
"(time to first token) by amortizing the computation time of the prefill "
|
||||
"across query tokens."
|
||||
msgstr ""
|
||||
"对于长上下文预填充,我们可以使用上下文并行,通过将预填充的计算时间分摊到查询令牌上,从而减少首令牌时间(TTFT)。"
|
||||
msgstr "对于长上下文预填充,我们可以使用上下文并行,通过将预填充的计算时间分摊到查询令牌上,从而减少首令牌时间(TTFT)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:12
|
||||
msgid ""
|
||||
"For long context decode, we can use context parallel to reduce KV cache "
|
||||
"duplication and offer more space for KV cache to increase the batch size "
|
||||
"(and hence the throughput)."
|
||||
msgstr ""
|
||||
"对于长上下文解码,我们可以使用上下文并行来减少KV缓存的重复存储,为KV缓存提供更多空间,从而增加批处理大小(进而提升吞吐量)。"
|
||||
msgstr "对于长上下文解码,我们可以使用上下文并行来减少KV缓存的重复存储,为KV缓存提供更多空间,从而增加批处理大小(进而提升吞吐量)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:14
|
||||
msgid ""
|
||||
"To learn more about the theory and implementation details of context "
|
||||
"parallel, please refer to the [context parallel developer "
|
||||
"guide](../../developer_guide/Design_Documents/context_parallel.md)."
|
||||
msgstr ""
|
||||
"要了解更多关于上下文并行的理论和实现细节,请参阅[上下文并行开发者指南](../../developer_guide/Design_Documents/context_parallel.md)。"
|
||||
msgstr "要了解更多关于上下文并行的理论和实现细节,请参阅[上下文并行开发者指南](../../developer_guide/Design_Documents/context_parallel.md)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:16
|
||||
msgid "Supported Scenarios"
|
||||
@@ -132,7 +129,9 @@ msgstr "如何使用上下文并行"
|
||||
msgid ""
|
||||
"You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and "
|
||||
"`decode_context_parallel_size`, refer to the following example:"
|
||||
msgstr "您可以通过 `prefill_context_parallel_size` 和 `decode_context_parallel_size` 启用 `PCP` 和 `DCP`,请参考以下示例:"
|
||||
msgstr ""
|
||||
"您可以通过 `prefill_context_parallel_size` 和 `decode_context_parallel_size` 启用"
|
||||
" `PCP` 和 `DCP`,请参考以下示例:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:29
|
||||
msgid "Offline example:"
|
||||
@@ -147,7 +146,9 @@ msgid ""
|
||||
"The total world size is `tensor_parallel_size` * "
|
||||
"`prefill_context_parallel_size`, so the examples above need 4 NPUs for "
|
||||
"each."
|
||||
msgstr "总的世界大小为 `tensor_parallel_size` * `prefill_context_parallel_size`,因此上述示例各需要4个NPU。"
|
||||
msgstr ""
|
||||
"总的世界大小为 `tensor_parallel_size` * "
|
||||
"`prefill_context_parallel_size`,因此上述示例各需要4个NPU。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:59
|
||||
msgid "Constraints"
|
||||
@@ -178,14 +179,18 @@ msgstr "对于基于GQA的模型,例如Qwen3-235B:"
|
||||
msgid ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) >= "
|
||||
"decode_context_parallel_size`"
|
||||
msgstr "`(tensor_parallel_size // num_key_value_heads) >= decode_context_parallel_size`"
|
||||
msgstr ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) >= "
|
||||
"decode_context_parallel_size`"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:67
|
||||
#, python-format
|
||||
msgid ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) % "
|
||||
"decode_context_parallel_size == 0`"
|
||||
msgstr "`(tensor_parallel_size // num_key_value_heads) % decode_context_parallel_size == 0`"
|
||||
msgstr ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) % "
|
||||
"decode_context_parallel_size == 0`"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:69
|
||||
msgid ""
|
||||
@@ -195,7 +200,9 @@ msgid ""
|
||||
"`block_size`(default: 128), which specifies CP to split KV cache in a "
|
||||
"block-interleave style. For example:"
|
||||
msgstr ""
|
||||
"在需要KV缓存传输的场景(例如KV池化、PD解耦)中使用上下文并行时,为简化KV缓存传输,必须将 `cp_kv_cache_interleave_size` 设置为与KV缓存 `block_size`(默认:128)相同的值,这指定了CP以块交错方式分割KV缓存。例如:"
|
||||
"在需要KV缓存传输的场景(例如KV池化、PD解耦)中使用上下文并行时,为简化KV缓存传输,必须将 "
|
||||
"`cp_kv_cache_interleave_size` 设置为与KV缓存 "
|
||||
"`block_size`(默认:128)相同的值,这指定了CP以块交错方式分割KV缓存。例如:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:80
|
||||
msgid "Experimental Results"
|
||||
@@ -206,9 +213,11 @@ msgid ""
|
||||
"To evaluate the effectiveness of Context Parallel in long sequence LLM "
|
||||
"inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, "
|
||||
"deploy PD disaggregate instances in the environment of 64 cards Ascend "
|
||||
"910C*64G (A3), the configuration and performance data are as follows."
|
||||
"Atlas A3 inference products*64G (A3), the configuration and performance "
|
||||
"data are as follows."
|
||||
msgstr ""
|
||||
"为评估上下文并行在长序列LLM推理场景中的有效性,我们使用 **DeepSeek-R1-W8A8** 和 **Qwen3-235B**,在64卡Ascend 910C*64G(A3)环境中部署PD解耦实例,配置和性能数据如下。"
|
||||
"为评估上下文并行在长序列LLM推理场景中的有效性,我们使用 **DeepSeek-R1-W8A8** 和 "
|
||||
"**Qwen3-235B**,在64卡Ascend Atlas A3推理产品*64G(A3)环境中部署PD解耦实例,配置和性能数据如下。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:84
|
||||
msgid "DeepSeek-R1-W8A8:"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -36,7 +36,9 @@ msgid ""
|
||||
"latency. This feature only adjusts host-side CPU affinity policies and "
|
||||
"**does not alter model execution logic or impact inference results**."
|
||||
msgstr ""
|
||||
"CPU 绑定是 vLLM 的一项性能优化功能,专为配备 **ARM 架构和昇腾 NPU** 的服务器设计。它将 vLLM 进程和线程固定到特定的 CPU 核心,以减少 CPU-NPU 跨 NUMA 通信开销并稳定推理延迟。此功能仅调整主机端的 CPU 亲和性策略,**不会改变模型执行逻辑或影响推理结果**。"
|
||||
"CPU 绑定是 vLLM 的一项性能优化功能,专为配备 **ARM 架构和昇腾 NPU** 的服务器设计。它将 vLLM 进程和线程固定到特定的 "
|
||||
"CPU 核心,以减少 CPU-NPU 跨 NUMA 通信开销并稳定推理延迟。此功能仅调整主机端的 CPU "
|
||||
"亲和性策略,**不会改变模型执行逻辑或影响推理结果**。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:7
|
||||
msgid "Usage"
|
||||
@@ -84,12 +86,13 @@ msgstr "IRQ 绑定的额外注意事项"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:70
|
||||
msgid ""
|
||||
"For best results, if you run inside a docker container, which `systemctl`"
|
||||
" is likely unavailable, stop `irqbalance` service on the host manually "
|
||||
"before starting vLLM. Also make sure the container has the necessary "
|
||||
"For best results, if you run inside a Docker container where `systemctl` "
|
||||
"is likely unavailable, stop the `irqbalance` service on the host manually"
|
||||
" before starting vLLM. Also make sure the container has the necessary "
|
||||
"permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:"
|
||||
msgstr ""
|
||||
"为获得最佳效果,如果您在 Docker 容器内运行(容器内可能没有 `systemctl`),请在启动 vLLM 前手动在主机上停止 `irqbalance` 服务。同时确保容器具有写入 `/proc/irq/*/smp_affinity` 以进行 IRQ 绑定所需的权限:"
|
||||
"为获得最佳效果,如果您在 Docker 容器内运行(容器内可能没有 `systemctl`),请在启动 vLLM 前手动在主机上停止 "
|
||||
"`irqbalance` 服务。同时确保容器具有写入 `/proc/irq/*/smp_affinity` 以进行 IRQ 绑定所需的权限:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:72
|
||||
msgid "**Stop `irqbalance` service**:"
|
||||
@@ -192,7 +195,9 @@ msgid ""
|
||||
"1. Confirm that required tools (taskset, lscpu, npu-smi) are installed "
|
||||
"and available; 2. Verify the Cpus_allowed_list in `/proc/self/status` is "
|
||||
"valid."
|
||||
msgstr "1. 确认所需工具(taskset, lscpu, npu-smi)已安装且可用;2. 验证 `/proc/self/status` 中的 Cpus_allowed_list 是有效的。"
|
||||
msgstr ""
|
||||
"1. 确认所需工具(taskset, lscpu, npu-smi)已安装且可用;2. 验证 `/proc/self/status` 中的 "
|
||||
"Cpus_allowed_list 是有效的。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:98
|
||||
msgid "Key Limitations"
|
||||
@@ -268,13 +273,17 @@ msgid ""
|
||||
"processes and threads to specific CPU cores, thereby stabilizing "
|
||||
"inference latency in Ascend NPU deployments (only applicable to ARM "
|
||||
"architectures)."
|
||||
msgstr "**核心目标**:通过将 vLLM 进程和线程固定到特定的 CPU 核心来减少跨 NUMA 通信,从而稳定昇腾 NPU 部署中的推理延迟(仅适用于 ARM 架构)。"
|
||||
msgstr ""
|
||||
"**核心目标**:通过将 vLLM 进程和线程固定到特定的 CPU 核心来减少跨 NUMA 通信,从而稳定昇腾 NPU 部署中的推理延迟(仅适用于"
|
||||
" ARM 架构)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:130
|
||||
msgid ""
|
||||
"**Usage**: Enable or disable with `enable_cpu_binding` via "
|
||||
"`additional_config` in both online and offline workflows."
|
||||
msgstr "**使用方法**:在在线和离线工作流中,通过 `additional_config` 中的 `enable_cpu_binding` 参数启用或禁用。"
|
||||
msgstr ""
|
||||
"**使用方法**:在在线和离线工作流中,通过 `additional_config` 中的 `enable_cpu_binding` "
|
||||
"参数启用或禁用。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:132
|
||||
msgid ""
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,8 +29,7 @@ msgid ""
|
||||
"during each inference iteration within the chunked prefilling strategy "
|
||||
"according to the resources and SLO targets, thereby improving the "
|
||||
"effective throughput and decreasing the TBT."
|
||||
msgstr ""
|
||||
"动态批处理是一种技术,它根据资源和SLO目标,在分块预填充策略的每次推理迭代中动态调整块大小,从而提高有效吞吐量并降低TBT。"
|
||||
msgstr "动态批处理是一种技术,它根据资源和SLO目标,在分块预填充策略的每次推理迭代中动态调整块大小,从而提高有效吞吐量并降低TBT。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:5
|
||||
msgid ""
|
||||
@@ -41,7 +40,8 @@ msgid ""
|
||||
"further improvements and this feature will support more XPUs in the "
|
||||
"future."
|
||||
msgstr ""
|
||||
"动态批处理由 `--SLO_limits_for_dynamic_batch` 参数的值控制。值得注意的是,目前仅支持910 B3,且解码token数量规模需低于2048。特别是在Qwen、Llama模型上,改进效果相当明显。我们正在进行进一步的改进,该功能未来将支持更多XPU。"
|
||||
"动态批处理由 `--SLO_limits_for_dynamic_batch` 参数的值控制。值得注意的是,目前仅支持910 "
|
||||
"B3,且解码token数量规模需低于2048。特别是在Qwen、Llama模型上,改进效果相当明显。我们正在进行进一步的改进,该功能未来将支持更多XPU。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:10
|
||||
msgid "Getting started"
|
||||
@@ -60,7 +60,10 @@ msgid ""
|
||||
"ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to "
|
||||
"the path `vllm_ascend/core/profile_table.csv`"
|
||||
msgstr ""
|
||||
"动态批处理目前依赖于一个保存在查找表中的离线成本模型来优化token预算。该查找表保存在一个'.csv'文件中,需要先从[A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv)下载,重命名后保存到路径 `vllm_ascend/core/profile_table.csv`。"
|
||||
"动态批处理目前依赖于一个保存在查找表中的离线成本模型来优化token预算。该查找表保存在一个'.csv'文件中,需要先从[A2-B3-BLK128.csv](https"
|
||||
"://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-"
|
||||
"ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv)下载,重命名后保存到路径 "
|
||||
"`vllm_ascend/core/profile_table.csv`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:16
|
||||
msgid ""
|
||||
@@ -75,12 +78,12 @@ msgstr "调优参数"
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:24
|
||||
msgid ""
|
||||
"`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) "
|
||||
"for the dynamic batch feature, larger values impose more constraints on "
|
||||
"the latency limitation, leading to higher effective throughput. The "
|
||||
"parameter can be selected according to the specific models or service "
|
||||
"requirements."
|
||||
"for the dynamic batch feature, larger values relax latency limitation, "
|
||||
"leading to higher effective throughput. The parameter can be selected "
|
||||
"according to the specific models or service requirements."
|
||||
msgstr ""
|
||||
"`--SLO_limits_for_dynamic_batch` 是动态批处理功能的调优参数(整数类型),较大的值会对延迟限制施加更多约束,从而带来更高的有效吞吐量。可以根据具体模型或服务需求选择该参数。"
|
||||
"`--SLO_limits_for_dynamic_batch` "
|
||||
"是动态批处理功能的调优参数(整数类型),较大的值会放宽延迟限制,从而带来更高的有效吞吐量。可以根据具体模型或服务需求选择该参数。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:32
|
||||
msgid "Supported Models"
|
||||
@@ -95,7 +98,9 @@ msgid ""
|
||||
"75`. Therefore, some additional tests are needed to select the best "
|
||||
"parameter."
|
||||
msgstr ""
|
||||
"目前,动态批处理在几个密集模型上表现更好,包括Qwen和Llama(从8B到32B),且 `tensor_parallel_size=8`。对于不同的模型,需要一个合适的 `SLO_limits_for_dynamic_batch` 参数。该参数的经验值通常是 `35、50或75`。因此,需要进行一些额外的测试来选择最佳参数。"
|
||||
"目前,动态批处理在几个密集模型上表现更好,包括Qwen和Llama(从8B到32B),且 "
|
||||
"`tensor_parallel_size=8`。对于不同的模型,需要一个合适的 `SLO_limits_for_dynamic_batch` "
|
||||
"参数。该参数的经验值通常是 `35、50或75`。因此,需要进行一些额外的测试来选择最佳参数。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:36
|
||||
msgid "Usage"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,14 +29,14 @@ msgstr "概述"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:5
|
||||
msgid ""
|
||||
"Expert balancing for MoE models in LLM serving is essential for optimal "
|
||||
"performance. Dynamically changing experts during inference can negatively"
|
||||
" impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due "
|
||||
"to stop-the-world operations. SwiftBalancer enables asynchronous expert "
|
||||
"load balancing with zero-overhead expert movement, ensuring seamless "
|
||||
"service continuity."
|
||||
"Expert balancing for MoE (Mixture of Experts) models in LLM (Large "
|
||||
"Language) serving is essential for optimal performance. Dynamically "
|
||||
"changing experts during inference can negatively impact TTFT (Time To "
|
||||
"First Token) and TPOT (Time Per Output Token) due to stop-the-world "
|
||||
"operations. SwiftBalancer enables asynchronous expert load balancing with"
|
||||
" zero-overhead expert movement, ensuring seamless service continuity."
|
||||
msgstr ""
|
||||
"在LLM服务中,MoE模型的专家均衡对于实现最佳性能至关重要。推理过程中动态改变专家会因全局暂停操作而对TTFT(首词元时间)和TPOT(每输出词元时间)产生负面影响。SwiftBalancer支持异步专家负载均衡,实现零开销的专家迁移,确保服务无缝连续。"
|
||||
"在LLM(大语言模型)服务中,MoE(混合专家)模型的专家均衡对于实现最佳性能至关重要。推理过程中动态改变专家会因全局暂停操作而对TTFT(首词元时间)和TPOT(每输出词元时间)产生负面影响。SwiftBalancer支持异步专家负载均衡,实现零开销的专家迁移,确保服务无缝连续。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:7
|
||||
msgid "EPLB Effects"
|
||||
@@ -107,7 +107,9 @@ msgid ""
|
||||
"Adjust expert_heat_collection_interval and algorithm_execution_interval "
|
||||
"based on workload patterns."
|
||||
msgstr ""
|
||||
"我们需要添加环境变量 `export DYNAMIC_EPLB=\"true\"` 来启用vLLM EPLB。启用具有自动调优参数的动态均衡。根据工作负载模式调整 expert_heat_collection_interval 和 algorithm_execution_interval。"
|
||||
"我们需要添加环境变量 `export DYNAMIC_EPLB=\"true\"` 来启用vLLM "
|
||||
"EPLB。启用具有自动调优参数的动态均衡。根据工作负载模式调整 expert_heat_collection_interval 和 "
|
||||
"algorithm_execution_interval。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:42
|
||||
msgid "Static EPLB"
|
||||
@@ -124,7 +126,8 @@ msgid ""
|
||||
"expert_map_record_path. This creates a baseline configuration for future "
|
||||
"deployments."
|
||||
msgstr ""
|
||||
"我们需要添加环境变量 `export EXPERT_MAP_RECORD=\"true\"` 来记录专家映射。使用 expert_map_record_path 生成初始专家分布映射。这将为未来的部署创建一个基线配置。"
|
||||
"我们需要添加环境变量 `export EXPERT_MAP_RECORD=\"true\"` 来记录专家映射。使用 "
|
||||
"expert_map_record_path 生成初始专家分布映射。这将为未来的部署创建一个基线配置。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:60
|
||||
msgid "Subsequent Deployments (Use Recorded Map)"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -141,10 +141,6 @@ msgstr "为保证哈希生成的一致性,启用 KV Pool 时,需要在所有
|
||||
msgid "Example of using Mooncake as a KV Pool backend"
|
||||
msgstr "使用 Mooncake 作为 KV Pool 后端的示例"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:35
|
||||
msgid "Software:"
|
||||
msgstr "软件:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:36
|
||||
msgid "Check NPU HCCN Configuration:"
|
||||
msgstr "检查 NPU HCCN 配置:"
|
||||
@@ -167,9 +163,9 @@ msgid ""
|
||||
"binaries>. First, we need to obtain the Mooncake project. Refer to the "
|
||||
"following command:"
|
||||
msgstr ""
|
||||
"Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 安装与编译指南:"
|
||||
"<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-"
|
||||
"binaries>。 首先,我们需要获取 Mooncake 项目。参考以下命令:"
|
||||
"Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 "
|
||||
"安装与编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-"
|
||||
"and-use-binaries>。 首先,我们需要获取 Mooncake 项目。参考以下命令:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:54
|
||||
msgid "(Optional) Replace go install url if the network is poor"
|
||||
@@ -266,7 +262,7 @@ msgid "`export HCCL_INTRA_ROCE_ENABLE=1`"
|
||||
msgstr "`export HCCL_INTRA_ROCE_ENABLE=1`"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md
|
||||
msgid "Required by direct transmission cheme on 800 I/T A2 series"
|
||||
msgid "Required by direct transmission scheme on 800 I/T A2 series"
|
||||
msgstr "800 I/T A2 系列直传方案所需"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:102
|
||||
@@ -280,8 +276,8 @@ msgid ""
|
||||
"(ascend_direct), see: "
|
||||
"<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
|
||||
msgstr ""
|
||||
"关于 HIXL (ascend_direct) 的常见故障排除和问题定位指南,请参阅:"
|
||||
"<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
|
||||
"关于 HIXL (ascend_direct) "
|
||||
"的常见故障排除和问题定位指南,请参阅:<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:107
|
||||
msgid "Run Mooncake Master"
|
||||
@@ -305,9 +301,9 @@ msgid ""
|
||||
"service. **global_segment_size**: Registered memory size per card to "
|
||||
"the KV Pool. **Needs to be aligned to 1GB.**"
|
||||
msgstr ""
|
||||
"**metadata_server**: 配置为 **P2PHANDSHAKE**。 **protocol:** 在 NPU 上必须设置为 'Ascend'。"
|
||||
"**device_name**: \"\" **master_server_address**: 配置 master 服务的 IP 和端口。 "
|
||||
"**global_segment_size**: 每张卡注册到 KV Pool 的内存大小。**需要对齐到 1GB。**"
|
||||
"**metadata_server**: 配置为 **P2PHANDSHAKE**。 **protocol:** 在 NPU 上必须设置为 "
|
||||
"'Ascend'。**device_name**: \"\" **master_server_address**: 配置 master 服务的"
|
||||
" IP 和端口。 **global_segment_size**: 每张卡注册到 KV Pool 的内存大小。**需要对齐到 1GB。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:129
|
||||
msgid "2.Start mooncake_master"
|
||||
@@ -326,8 +322,9 @@ msgid ""
|
||||
"`--default_kv_lease_ttl` and keep it larger than `ASCEND_CONNECT_TIMEOUT`"
|
||||
" and `ASCEND_TRANSFER_TIMEOUT`."
|
||||
msgstr ""
|
||||
"`eviction_high_watermark_ratio` 决定了 Mooncake Store 执行淘汰的水位线,`eviction_ratio` 决定了将被淘汰的存储对象比例。"
|
||||
"`default_kv_lease_ttl` 控制 KV 对象的默认租约 TTL(毫秒);通过 `--default_kv_lease_ttl` 配置,并保持其大于 "
|
||||
"`eviction_high_watermark_ratio` 决定了 Mooncake Store "
|
||||
"执行淘汰的水位线,`eviction_ratio` 决定了将被淘汰的存储对象比例。`default_kv_lease_ttl` 控制 KV "
|
||||
"对象的默认租约 TTL(毫秒);通过 `--default_kv_lease_ttl` 配置,并保持其大于 "
|
||||
"`ASCEND_CONNECT_TIMEOUT` 和 `ASCEND_TRANSFER_TIMEOUT`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:140
|
||||
@@ -347,8 +344,9 @@ msgid ""
|
||||
"performs kv_transfer, while `AscendStoreConnector` serves as the prefix-"
|
||||
"cache node."
|
||||
msgstr ""
|
||||
"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 `AscendStoreConnector`。"
|
||||
"`MooncakeConnectorV1` 执行 kv_transfer,而 `AscendStoreConnector` 作为 prefix-cache 节点。"
|
||||
"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 "
|
||||
"`AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer,而 "
|
||||
"`AscendStoreConnector` 作为 prefix-cache 节点。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:146
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:611
|
||||
@@ -379,9 +377,10 @@ msgid ""
|
||||
"AscendStoreConnector. If the Prefill node enables PP, `prefill_pp_size` "
|
||||
"or `prefill_pp_layer_partition` also needs to be set. Example as follows:"
|
||||
msgstr ""
|
||||
"目前,PD 解耦中的键值池默认仅存储 Prefill 节点生成的 kv cache。在使用 MLA 的模型中,现已支持 Decode 节点存储 kv cache 供 "
|
||||
"Prefill 节点使用,通过在 AscendStoreConnector 中添加 `consumer_is_to_put: true` 来启用。如果 Prefill "
|
||||
"节点启用了 PP,则还需要设置 `prefill_pp_size` 或 `prefill_pp_layer_partition`。示例如下:"
|
||||
"目前,PD 解耦中的键值池默认仅存储 Prefill 节点生成的 kv cache。在使用 MLA 的模型中,现已支持 Decode 节点存储 "
|
||||
"kv cache 供 Prefill 节点使用,通过在 AscendStoreConnector 中添加 `consumer_is_to_put:"
|
||||
" true` 来启用。如果 Prefill 节点启用了 PP,则还需要设置 `prefill_pp_size` 或 "
|
||||
"`prefill_pp_layer_partition`。示例如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:308
|
||||
msgid "2、Start proxy_server"
|
||||
@@ -452,7 +451,10 @@ msgid ""
|
||||
"required. Establishing these connections introduces a one-time time "
|
||||
"overhead and persistent device memory consumption (4 MB of device memory "
|
||||
"per connection)."
|
||||
msgstr "这是因为当涉及设备到设备通信时,HCCL 单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 MB 设备内存)。"
|
||||
msgstr ""
|
||||
"这是因为当涉及设备到设备通信时,HCCL "
|
||||
"单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 "
|
||||
"MB 设备内存)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:404
|
||||
msgid ""
|
||||
@@ -473,19 +475,25 @@ msgstr "安装 Memcache"
|
||||
msgid ""
|
||||
"**MemCache depends on MemFabric. Therefore, MemFabric must be "
|
||||
"installed.Installing the memcache after the memfabric is installed.**"
|
||||
msgstr "**MemCache 依赖于 MemFabric。因此,必须先安装 MemFabric。在 memfabric 安装完成后,再安装 memcache。**"
|
||||
msgstr ""
|
||||
"**MemCache 依赖于 MemFabric。因此,必须先安装 MemFabric。在 memfabric 安装完成后,再安装 "
|
||||
"memcache。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:412
|
||||
msgid ""
|
||||
"**memfabric_hybrid**: "
|
||||
"<https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
|
||||
msgstr "**memfabric_hybrid**: <https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
|
||||
msgstr ""
|
||||
"**memfabric_hybrid**: "
|
||||
"<https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:414
|
||||
msgid ""
|
||||
"**memcache**: "
|
||||
"<https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
|
||||
msgstr "**memcache**: <https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
|
||||
msgstr ""
|
||||
"**memcache**: "
|
||||
"<https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:416
|
||||
msgid "Configuring the memcache Config File"
|
||||
@@ -509,7 +517,9 @@ msgid ""
|
||||
"You are advised to copy mmc-local.conf and mmc-meta.conf to your own path"
|
||||
" and modify them, and set the MMC_META_CONFIG_PATH environment variable "
|
||||
"to the path of your own mmc-meta.conf file."
|
||||
msgstr "建议您将 mmc-local.conf 和 mmc-meta.conf 复制到您自己的路径并进行修改,并将 MMC_META_CONFIG_PATH 环境变量设置为您自己的 mmc-meta.conf 文件的路径。"
|
||||
msgstr ""
|
||||
"建议您将 mmc-local.conf 和 mmc-meta.conf 复制到您自己的路径并进行修改,并将 "
|
||||
"MMC_META_CONFIG_PATH 环境变量设置为您自己的 mmc-meta.conf 文件的路径。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:436
|
||||
msgid "**mmc-meta.conf:**"
|
||||
@@ -574,7 +584,9 @@ msgid ""
|
||||
" ROCE available, recommended for A2), `device_sdma` (supported for A3 "
|
||||
"when HCCS available, recommended for A3). Currently does not support "
|
||||
"heterogeneous protocol setting."
|
||||
msgstr "`host_rdma` (默认), `device_rdma` (A2 和 A3 在设备 ROCE 可用时支持,推荐用于 A2), `device_sdma` (A3 在 HCCS 可用时支持,推荐用于 A3)。目前不支持异构协议设置。"
|
||||
msgstr ""
|
||||
"`host_rdma` (默认), `device_rdma` (A2 和 A3 在设备 ROCE 可用时支持,推荐用于 A2), "
|
||||
"`device_sdma` (A3 在 HCCS 可用时支持,推荐用于 A3)。目前不支持异构协议设置。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md
|
||||
msgid "`ock.mmc.local_service.dram.size`"
|
||||
@@ -607,7 +619,10 @@ msgid ""
|
||||
"Using `MultiConnector` to simultaneously utilize both "
|
||||
"`MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` "
|
||||
"performs kv_transfer, while `AscendStoreConnector` enables KV Cache Pool"
|
||||
msgstr "使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 `AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer,而 `AscendStoreConnector` 启用 KV 缓存池"
|
||||
msgstr ""
|
||||
"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 "
|
||||
"`AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer,而 "
|
||||
"`AscendStoreConnector` 启用 KV 缓存池"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:609
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:918
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,35 +29,33 @@ msgstr "概述"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:5
|
||||
msgid ""
|
||||
"**Layer Shard Linear** is a memory-optimization feature designed for "
|
||||
"**Layer Sharding Linear** is a memory-optimization feature designed for "
|
||||
"large language model (LLM) inference. It addresses the high memory "
|
||||
"pressure caused by **repeated linear operators across many layers** that "
|
||||
"share identical structure but have distinct weights."
|
||||
msgstr ""
|
||||
"**层分片线性算子** 是一项为大语言模型推理设计的内存优化功能。它旨在解决由**跨越多层的重复线性算子**所引起的高内存压力,这些算子结构相同但权重不同。"
|
||||
"**层分片线性算子** "
|
||||
"是一项为大语言模型推理设计的内存优化功能。它旨在解决由**跨越多层的重复线性算子**所引起的高内存压力,这些算子结构相同但权重不同。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:7
|
||||
msgid ""
|
||||
"Instead of replicating all weights on every device, **Layer Shard Linear "
|
||||
"shards the weights of a \"series\" of such operators across the NPU "
|
||||
"devices in a communication group**:"
|
||||
msgstr ""
|
||||
"与在每个设备上复制所有权重不同,**层分片线性算子将此类算子的一个\"系列\"的权重分片到通信组内的NPU设备上**:"
|
||||
"Instead of replicating all weights on every device, **Layer Sharding "
|
||||
"Linear shards the weights of a \"series\" of such operators across the "
|
||||
"NPU devices in a communication group**:"
|
||||
msgstr "与在每个设备上复制所有权重不同,**层分片线性算子将此类算子的一个\"系列\"的权重分片到通信组内的NPU设备上**:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:9
|
||||
msgid ""
|
||||
"The **i-th layer's linear weight** is stored **only on device `i % K`**, "
|
||||
"The **i-th layer's linear weight** is stored **only on device `i % K`** "
|
||||
"where `K` is the number of devices in the group."
|
||||
msgstr ""
|
||||
"**第 i 层的线性权重** **仅存储在设备 `i % K` 上**,其中 `K` 是组内的设备数量。"
|
||||
msgstr "**第 i 层的线性权重** **仅存储在设备 `i % K` 上**,其中 `K` 是组内的设备数量。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:10
|
||||
msgid ""
|
||||
"Other devices hold a lightweight **shared dummy tensor** during "
|
||||
"initialization and fetch the real weight **on-demand** via asynchronous "
|
||||
"broadcast during the forward pass."
|
||||
msgstr ""
|
||||
"其他设备在初始化期间持有一个轻量级的**共享虚拟张量**,并在前向传播期间通过异步广播**按需**获取真实权重。"
|
||||
msgstr "其他设备在初始化期间持有一个轻量级的**共享虚拟张量**,并在前向传播期间通过异步广播**按需**获取真实权重。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:12
|
||||
msgid ""
|
||||
@@ -75,8 +73,7 @@ msgstr ""
|
||||
msgid ""
|
||||
"This approach **preserves exact computational semantics** while "
|
||||
"**significantly reducing NPU memory footprint**, especially critical for:"
|
||||
msgstr ""
|
||||
"这种方法**保持了精确的计算语义**,同时**显著减少了NPU内存占用**,这对于以下情况尤其关键:"
|
||||
msgstr "这种方法**保持了精确的计算语义**,同时**显著减少了NPU内存占用**,这对于以下情况尤其关键:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:16
|
||||
msgid "Extremely deep architectures (e.g., DeepSeek-V3/R1 with 61 layers);"
|
||||
@@ -89,7 +86,9 @@ msgid ""
|
||||
"/vllm-ascend/pull/4188)**, where the full `O` (output) projection matrix "
|
||||
"must reside in memory per layer;"
|
||||
msgstr ""
|
||||
"使用 **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** 或 **[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)** 的模型,其中完整的`O`(输出)投影矩阵必须驻留在每层的内存中;"
|
||||
"使用 **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** 或 "
|
||||
"**[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)** "
|
||||
"的模型,其中完整的`O`(输出)投影矩阵必须驻留在每层的内存中;"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:18
|
||||
msgid ""
|
||||
@@ -111,12 +110,13 @@ msgstr "层分片"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:26
|
||||
msgid ""
|
||||
"**Figure.** Layer Shard Linear workflow: weights are sharded by layer "
|
||||
"**Figure.** Layer Sharding Linear workflow: weights are sharded by layer "
|
||||
"across devices (top), and during forward execution (bottom), asynchronous"
|
||||
" broadcast **pre-fetches** the next layer's weight while the current "
|
||||
"layer computes—enabling **zero-overhead** weight loading."
|
||||
"layer computes-enabling **zero-overhead** weight loading."
|
||||
msgstr ""
|
||||
"**图.** 层分片线性算子工作流程:权重按层分片到各设备(顶部),在前向执行期间(底部),异步广播**预取**下一层的权重,同时当前层进行计算——实现**零开销**的权重加载。"
|
||||
"**图.** "
|
||||
"层分片线性算子工作流程:权重按层分片到各设备(顶部),在前向执行期间(底部),异步广播**预取**下一层的权重,同时当前层进行计算——实现**零开销**的权重加载。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:30
|
||||
msgid "Getting Started"
|
||||
@@ -124,11 +124,12 @@ msgstr "快速开始"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:32
|
||||
msgid ""
|
||||
"To enable **Layer Shard Linear**, specify the target linear layers using "
|
||||
"the `--additional-config` argument when launching your inference job. For"
|
||||
" example, to shard the `o_proj` and `q_b_proj` layers, use:"
|
||||
"To enable **Layer Sharding Linear**, specify the target linear layers "
|
||||
"using the `--additional-config` argument when launching your inference "
|
||||
"job. For example, to shard the `o_proj` and `q_b_proj` layers, use:"
|
||||
msgstr ""
|
||||
"要启用**层分片线性算子**,请在启动推理作业时使用 `--additional-config` 参数指定目标线性层。例如,要对 `o_proj` 和 `q_b_proj` 层进行分片,请使用:"
|
||||
"要启用**层分片线性算子**,请在启动推理作业时使用 `--additional-config` 参数指定目标线性层。例如,要对 `o_proj`"
|
||||
" 和 `q_b_proj` 层进行分片,请使用:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:40
|
||||
msgid ""
|
||||
@@ -136,7 +137,8 @@ msgid ""
|
||||
"be enabled on the **P node** with `kv_role=\"kv_producer\"`. "
|
||||
"`kv_role=\"kv_consumer\"` and `kv_role=\"kv_both\"` are not supported."
|
||||
msgstr ""
|
||||
"**限制** 在PD解耦部署中,层分片只能在 `kv_role=\"kv_producer\"` 的 **P节点** 上启用。不支持 `kv_role=\"kv_consumer\"` 和 `kv_role=\"kv_both\"`。"
|
||||
"**限制** 在PD解耦部署中,层分片只能在 `kv_role=\"kv_producer\"` 的 **P节点** 上启用。不支持 "
|
||||
"`kv_role=\"kv_consumer\"` 和 `kv_role=\"kv_both\"`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:46
|
||||
msgid "Supported Scenarios"
|
||||
@@ -157,7 +159,8 @@ msgid ""
|
||||
"resident in memory for each layer. Layer sharding significantly reduces "
|
||||
"memory pressure by distributing these weights across devices."
|
||||
msgstr ""
|
||||
"当使用 [FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188) 时,完整的输出投影(`o_proj`)矩阵必须驻留在每层的内存中。层分片通过将这些权重分布到各设备上,显著降低了内存压力。"
|
||||
"当使用 [FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188) "
|
||||
"时,完整的输出投影(`o_proj`)矩阵必须驻留在每层的内存中。层分片通过将这些权重分布到各设备上,显著降低了内存压力。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:54
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:71
|
||||
@@ -175,11 +178,12 @@ msgid ""
|
||||
"stored per layer. Sharding these layers across NPUs helps fit extremely "
|
||||
"deep models (e.g., 61-layer architectures) into limited device memory."
|
||||
msgstr ""
|
||||
"使用 [DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702) 时,`q_b_proj` 和 `o_proj` 层都需要每层存储大型权重矩阵。将这些层分片到多个NPU上有助于将极深的模型(例如,61层架构)装入有限的设备内存中。"
|
||||
"使用 [DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702) "
|
||||
"时,`q_b_proj` 和 `o_proj` "
|
||||
"层都需要每层存储大型权重矩阵。将这些层分片到多个NPU上有助于将极深的模型(例如,61层架构)装入有限的设备内存中。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:69
|
||||
msgid ""
|
||||
"In PD-disaggregated deployments, this mode is supported only on the **P "
|
||||
"node** with `kv_role=\"kv_producer\"`."
|
||||
msgstr ""
|
||||
"在PD解耦部署中,此模式仅在 `kv_role=\"kv_producer\"` 的 **P节点** 上受支持。"
|
||||
msgstr "在PD解耦部署中,此模式仅在 `kv_role=\"kv_producer\"` 的 **P节点** 上受支持。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -21,13 +21,12 @@ msgstr ""
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:1
|
||||
msgid "Netloader Guide"
|
||||
msgstr "网络加载器指南"
|
||||
msgstr "Netloader 指南"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:3
|
||||
msgid ""
|
||||
"This guide provides instructions for using **Netloader** as a weight-"
|
||||
"loader plugin for acceleration in **vLLM Ascend**."
|
||||
msgstr "本指南介绍如何将 **Netloader** 用作权重加载器插件,以在 **vLLM Ascend** 中实现加速。"
|
||||
"This guide provides instructions for using **Netloader** as a weight-loader plugin for acceleration in **vLLM Ascend**."
|
||||
msgstr "本指南介绍如何使用 **Netloader** 作为权重加载器插件,以在 **vLLM Ascend** 中实现加速。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:7
|
||||
msgid "Overview"
|
||||
@@ -35,9 +34,7 @@ msgstr "概述"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:9
|
||||
msgid ""
|
||||
"Netloader leverages high-bandwidth peer-to-peer (P2P) transfers between "
|
||||
"NPU cards to load model weights. It is implemented as a plugin (via the "
|
||||
"`register_model_loader` API added in vLLM 0.10). The workflow is:"
|
||||
"Netloader leverages high-bandwidth peer-to-peer (P2P) transfers between NPU cards to load model weights. It is implemented as a plugin (via the `register_model_loader` API added in vLLM 0.10). The workflow is:"
|
||||
msgstr "Netloader 利用 NPU 卡之间的高带宽点对点 (P2P) 传输来加载模型权重。它通过插件实现(使用 vLLM 0.10 中添加的 `register_model_loader` API)。工作流程如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:11
|
||||
@@ -50,16 +47,12 @@ msgstr "新的 **客户端** 实例请求权重传输。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:13
|
||||
msgid ""
|
||||
"After validating that the model and partitioning match, the client uses "
|
||||
"HCCL collective communication (send/recv) to receive weights in the same "
|
||||
"order as stored in the model."
|
||||
"After validating that the model and partitioning match, the client uses HCCL collective communication (send/recv) to receive weights in the same order as stored in the model."
|
||||
msgstr "在验证模型和分区匹配后,客户端使用 HCCL 集合通信 (send/recv) 按照模型中存储的相同顺序接收权重。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:15
|
||||
msgid ""
|
||||
"The server runs alongside normal inference tasks via sub-threads and via "
|
||||
"`stateless_init_torch_distributed_process_group` in vLLM. The client thus"
|
||||
" takes over weight initialization without needing to load from storage."
|
||||
"The server runs alongside normal inference tasks via sub-threads and via `stateless_init_torch_distributed_process_group` in vLLM. The client thus takes over weight initialization without needing to load from storage."
|
||||
msgstr "服务器通过子线程以及 vLLM 中的 `stateless_init_torch_distributed_process_group` 与常规推理任务并行运行。因此,客户端接管权重初始化,无需从存储加载。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:17
|
||||
@@ -68,11 +61,11 @@ msgstr "流程图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:19
|
||||
msgid ""
|
||||
msgstr ""
|
||||
msgstr ""
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:19
|
||||
msgid "netloader flowchart"
|
||||
msgstr "网络加载器流程图"
|
||||
msgstr "netloader 流程图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:21
|
||||
msgid "Timing Diagram"
|
||||
@@ -80,11 +73,11 @@ msgstr "时序图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:23
|
||||
msgid ""
|
||||
msgstr ""
|
||||
msgstr ""
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:23
|
||||
msgid "netloader timing diagram"
|
||||
msgstr "网络加载器时序图"
|
||||
msgstr "netloader 时序图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:25
|
||||
msgid "Application Scenarios"
|
||||
@@ -92,30 +85,22 @@ msgstr "应用场景"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:27
|
||||
msgid ""
|
||||
"**Reduce startup latency**: By reusing already loaded weights and "
|
||||
"transferring them directly between NPU cards, Netloader cuts down model "
|
||||
"loading time versus conventional remote/local pull strategies."
|
||||
"**Reduce startup latency**: By reusing already loaded weights and transferring them directly between NPU cards, Netloader cuts down model loading time versus conventional remote/local pull strategies."
|
||||
msgstr "**减少启动延迟**:通过重用已加载的权重并在 NPU 卡之间直接传输,Netloader 相比传统的远程/本地拉取策略,缩短了模型加载时间。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:28
|
||||
msgid ""
|
||||
"**Relieve network & storage load**: Avoid repeated downloads of weight "
|
||||
"files from remote repositories, thus reducing pressure on central storage"
|
||||
" and network traffic."
|
||||
"**Relieve network & storage load**: Avoid repeated downloads of weight files from remote repositories, thus reducing pressure on central storage and network traffic."
|
||||
msgstr "**减轻网络和存储负载**:避免从远程仓库重复下载权重文件,从而减轻中心存储和网络流量的压力。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:29
|
||||
msgid ""
|
||||
"**Improve resource utilization & lower cost**: Faster loading allows less"
|
||||
" reliance on standby compute nodes; resources can be scaled up/down more "
|
||||
"flexibly."
|
||||
"**Improve resource utilization & lower cost**: Faster loading allows less reliance on standby compute nodes; resources can be scaled up/down more flexibly."
|
||||
msgstr "**提高资源利用率并降低成本**:更快的加载速度减少了对备用计算节点的依赖;资源可以更灵活地伸缩。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:30
|
||||
msgid ""
|
||||
"**Enhance business continuity & high availability**: In failure recovery,"
|
||||
" new instances can quickly take over without long downtime, improving "
|
||||
"system reliability and user experience."
|
||||
"**Enhance business continuity & high availability**: In failure recovery, new instances can quickly take over without long downtime, improving system reliability and user experience."
|
||||
msgstr "**增强业务连续性和高可用性**:在故障恢复时,新实例可以快速接管而无需长时间停机,从而提高系统可靠性和用户体验。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:34
|
||||
@@ -124,9 +109,7 @@ msgstr "使用方法"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:36
|
||||
msgid ""
|
||||
"To enable Netloader, pass `--load-format=netloader` and provide "
|
||||
"configuration via `--model-loader-extra-config` (as a JSON string). Below"
|
||||
" are the supported configuration fields:"
|
||||
"To enable Netloader, pass `--load-format=netloader` and provide configuration via `--model-loader-extra-config` (as a JSON string). Below are the supported configuration fields:"
|
||||
msgstr "要启用 Netloader,请传递 `--load-format=netloader` 并通过 `--model-loader-extra-config`(作为 JSON 字符串)提供配置。以下是支持的配置字段:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -156,12 +139,7 @@ msgstr "列表"
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"Weight data sources. Each item is a map with `device_id` and `sources`, "
|
||||
"specifying the rank and its endpoints (IP:port). <br>Example: "
|
||||
"`{\"SOURCE\": [{\"device_id\": 0, \"sources\": "
|
||||
"[\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": "
|
||||
"[\"10.170.22.152:11228\"]}]}` <br>If omitted or empty, fallback to "
|
||||
"default loader. The SOURCE here is second priority."
|
||||
"Weight data sources. Each item is a map with `device_id` and `sources`, specifying the rank and its endpoints (IP:port). <br>Example: `{\"SOURCE\": [{\"device_id\": 0, \"sources\": [\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": [\"10.170.22.152:11228\"]}]}` <br>If omitted or empty, fallback to default loader. The SOURCE here is second priority."
|
||||
msgstr "权重数据源。每个条目是一个包含 `device_id` 和 `sources` 的映射,指定了 rank 及其端点 (IP:端口)。<br>示例:`{\"SOURCE\": [{\"device_id\": 0, \"sources\": [\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": [\"10.170.22.152:11228\"]}]}` <br>如果省略或为空,则回退到默认加载器。此处的 SOURCE 是第二优先级。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -198,9 +176,7 @@ msgstr "服务器监听器的基础端口。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
msgid ""
|
||||
"The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port "
|
||||
"is chosen. Valid range: 1024–65535. If out of range, that server instance"
|
||||
" won’t open a listener."
|
||||
"The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port is chosen. Valid range: 1024–65535. If out of range, that server instance won’t open a listener."
|
||||
msgstr "实际端口 = `LISTEN_PORT + RANK`。如果省略,则选择一个随机有效端口。有效范围:1024–65535。如果超出范围,该服务器实例将不会打开监听器。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -213,10 +189,7 @@ msgstr "处理量化模型中 int8 参数的行为。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
msgid ""
|
||||
"One of `[\"hbm\", \"dram\", \"no\"]`. <br> - `hbm`: copy original int8 "
|
||||
"parameters to high-bandwidth memory (HBM) (may cost a lot of HBM). <br> -"
|
||||
" `dram`: copy to DRAM. <br> - `no`: no special handling (may lead to "
|
||||
"divergence or unpredictable behavior). Default: `\"no\"`."
|
||||
"One of `[\"hbm\", \"dram\", \"no\"]`. <br> - `hbm`: copy original int8 parameters to high-bandwidth memory (HBM) (may cost a lot of HBM). <br> - `dram`: copy to DRAM. <br> - `no`: no special handling (may lead to divergence or unpredictable behavior). Default: `\"no\"`."
|
||||
msgstr "取值为 `[\"hbm\", \"dram\", \"no\"]` 之一。<br> - `hbm`:将原始 int8 参数复制到高带宽内存 (HBM)(可能消耗大量 HBM)。<br> - `dram`:复制到 DRAM。<br> - `no`:不进行特殊处理(可能导致分歧或不可预测的行为)。默认值:`\"no\"`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -242,8 +215,7 @@ msgstr "在服务器模式下,用于写入每个 rank 监听器地址/端口
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"If set, each rank writes to `{OUTPUT_PREFIX}{RANK}.txt` (text), content ="
|
||||
" `IP:Port`."
|
||||
"If set, each rank writes to `{OUTPUT_PREFIX}{RANK}.txt` (text), content = `IP:Port`."
|
||||
msgstr "如果设置,每个 rank 将写入 `{OUTPUT_PREFIX}{RANK}.txt`(文本文件),内容为 `IP:Port`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -256,8 +228,7 @@ msgstr "指定上述配置的 JSON 文件路径。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
msgid ""
|
||||
"If provided, the SOURCE inside this file has **first priority** "
|
||||
"(overrides SOURCE in other configs)."
|
||||
"If provided, the SOURCE inside this file has **first priority** (overrides SOURCE in other configs)."
|
||||
msgstr "如果提供,此文件内的 SOURCE 具有 **最高优先级**(覆盖其他配置中的 SOURCE)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:50
|
||||
@@ -294,14 +265,12 @@ msgstr "`<port>`:服务器上的基础监听端口"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:85
|
||||
msgid ""
|
||||
"`<server_IP>` + `<server_Port>`: IP and port of the Netloader server "
|
||||
"(from server log)"
|
||||
"`<server_IP>` + `<server_Port>`: IP and port of the Netloader server (from server log)"
|
||||
msgstr "`<server_IP>` + `<server_Port>`:Netloader 服务器的 IP 和端口(来自服务器日志)"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:86
|
||||
msgid ""
|
||||
"`<device_id_diff_from_server>`: Client device ID (must differ from "
|
||||
"server’s)"
|
||||
"`<device_id_diff_from_server>`: Client device ID (must differ from server’s)"
|
||||
msgstr "`<device_id_diff_from_server>`:客户端设备 ID(必须与服务器的不同)"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:87
|
||||
@@ -310,8 +279,7 @@ msgstr "`<client_port>`:客户端监听的端口"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:89
|
||||
msgid ""
|
||||
"After startup, you can test consistency by issuing inference requests "
|
||||
"with temperature = 0 and comparing outputs."
|
||||
"After startup, you can test consistency by issuing inference requests with temperature = 0 and comparing outputs."
|
||||
msgstr "启动后,您可以通过发送 temperature = 0 的推理请求并比较输出来测试一致性。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:93
|
||||
@@ -320,22 +288,15 @@ msgstr "注意事项与限制"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:95
|
||||
msgid ""
|
||||
"If Netloader is used, **each worker process** must bind a listening port."
|
||||
" That port may be user-specified or assigned randomly. If user-specified,"
|
||||
" ensure it is available."
|
||||
"If Netloader is used, **each worker process** must bind a listening port. That port may be user-specified or assigned randomly. If user-specified, ensure it is available."
|
||||
msgstr "如果使用 Netloader,**每个工作进程** 都必须绑定一个监听端口。该端口可以是用户指定的,也可以是随机分配的。如果是用户指定的,请确保其可用。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:96
|
||||
msgid ""
|
||||
"Netloader requires extra HBM memory to establish HCCL connections (i.e. "
|
||||
"`HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient "
|
||||
"capacity (e.g. via `--gpu-memory-utilization`)."
|
||||
msgstr "Netloader 需要额外的 HBM 内存来建立 HCCL 连接(即 `HCCL_BUFFERSIZE`,默认约 200 MB)。用户应预留足够的容量(例如通过 `--gpu-memory-utilization`)。"
|
||||
"Netloader requires extra on-chip memory to establish HCCL connections (i.e. `HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient capacity (e.g. via `--gpu-memory-utilization`)."
|
||||
msgstr "Netloader 需要额外的片上内存来建立 HCCL 连接(即 `HCCL_BUFFERSIZE`,默认约 200 MB)。用户应预留足够的容量(例如通过 `--gpu-memory-utilization`)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:97
|
||||
msgid ""
|
||||
"It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or"
|
||||
" slow connections/transmissions. Related info: [vLLM Issue "
|
||||
"#16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR "
|
||||
"#16226](https://github.com/vllm-project/vllm/pull/16226)."
|
||||
"It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or slow connections/transmissions. Related info: [vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226)."
|
||||
msgstr "建议设置 `VLLM_SLEEP_WHEN_IDLE=1` 以缓解不稳定或缓慢的连接/传输。相关信息:[vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226)。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -33,11 +33,12 @@ msgid ""
|
||||
"ascend/issues/4715), this is a simple ACLGraph graph mode acceleration "
|
||||
"solution based on Fx graphs."
|
||||
msgstr ""
|
||||
"如 [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715) 中所述,这是一个基于 Fx 图的简单 ACLGraph 图模式加速解决方案。"
|
||||
"如 [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715) "
|
||||
"中所述,这是一个基于 Fx 图的简单 ACLGraph 图模式加速解决方案。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/npugraph_ex.md:7
|
||||
msgid "Using npugraph_ex"
|
||||
msgstr "使用 npugraph_ex"
|
||||
msgid "Using Npugraph_ex"
|
||||
msgstr "使用 Npugraph_ex"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/npugraph_ex.md:9
|
||||
msgid ""
|
||||
@@ -58,4 +59,6 @@ msgid ""
|
||||
"You can find more details about "
|
||||
"[npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)"
|
||||
msgstr ""
|
||||
"您可以在 [npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html) 找到更多详细信息。"
|
||||
"您可以在 "
|
||||
"[npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)"
|
||||
" 找到更多详细信息。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,8 +35,7 @@ msgid ""
|
||||
"such as PPO, GRPO, or DPO. During training, the policy model typically "
|
||||
"performs autoregressive generation using inference engines like vLLM, "
|
||||
"followed by forward and backward passes for optimization."
|
||||
msgstr ""
|
||||
"睡眠模式是一个专为从NPU内存中卸载模型权重并丢弃KV缓存而设计的API。此功能对于强化学习(RL)后训练工作负载至关重要,特别是在PPO、GRPO或DPO等在线算法中。在训练期间,策略模型通常使用vLLM等推理引擎执行自回归生成,随后进行前向和反向传播以完成优化。"
|
||||
msgstr "睡眠模式是一个专为从NPU内存中卸载模型权重并丢弃KV缓存而设计的API。此功能对于强化学习(RL)后训练工作负载至关重要,特别是在PPO、GRPO或DPO等在线算法中。在训练期间,策略模型通常使用vLLM等推理引擎执行自回归生成,随后进行前向和反向传播以完成优化。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:7
|
||||
msgid ""
|
||||
@@ -44,8 +43,7 @@ msgid ""
|
||||
"parallelism strategies, it becomes crucial to free KV cache and even "
|
||||
"offload model parameters stored within vLLM during training. This ensures"
|
||||
" efficient memory utilization and avoids resource contention on the NPU."
|
||||
msgstr ""
|
||||
"由于生成阶段和训练阶段可能采用不同的模型并行策略,因此在训练期间释放KV缓存,甚至卸载存储在vLLM中的模型参数变得至关重要。这确保了高效的内存利用,并避免了NPU上的资源争用。"
|
||||
msgstr "由于生成阶段和训练阶段可能采用不同的模型并行策略,因此在训练期间释放KV缓存,甚至卸载存储在vLLM中的模型参数变得至关重要。这确保了高效的内存利用,并避免了NPU上的资源争用。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:9
|
||||
msgid "Getting started"
|
||||
@@ -55,11 +53,13 @@ msgstr "快速入门"
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in"
|
||||
" vllm is under a specific memory pool. During model loading and KV cache "
|
||||
" vLLM is under a specific memory pool. During model loading and KV cache "
|
||||
"initialization, we tag the memory as a map: `{\"weight\": data, "
|
||||
"\"kv_cache\": data}`."
|
||||
msgstr ""
|
||||
"当设置 `enable_sleep_mode=True` 时,我们在vllm中管理内存(分配、释放)的方式将在一个特定的内存池下进行。在模型加载和KV缓存初始化期间,我们将内存标记为一个映射:`{\"weight\": data, \"kv_cache\": data}`。"
|
||||
"当设置 `enable_sleep_mode=True` "
|
||||
"时,我们在vLLM中管理内存(分配、释放)的方式将在一个特定的内存池下进行。在模型加载和KV缓存初始化期间,我们将内存标记为一个映射:`{\"weight\":"
|
||||
" data, \"kv_cache\": data}`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:13
|
||||
msgid ""
|
||||
@@ -115,7 +115,8 @@ msgid ""
|
||||
"`export COMPILE_CUSTOM_KERNELS=1`."
|
||||
msgstr ""
|
||||
"由于此功能使用了底层API "
|
||||
"[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html),为了使用睡眠模式,您应遵循[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)并从源码构建。如果您使用的版本低于v0.12.0rc1,请记得设置 `export COMPILE_CUSTOM_KERNELS=1`。"
|
||||
"[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html),为了使用睡眠模式,您应遵循[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)并从源码构建。如果您使用的版本低于v0.12.0rc1,请记得设置"
|
||||
" `export COMPILE_CUSTOM_KERNELS=1`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:28
|
||||
msgid "Usage"
|
||||
@@ -139,4 +140,5 @@ msgid ""
|
||||
" are under a dev-mode, and explicitly specify the dev environment "
|
||||
"`VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up)."
|
||||
msgstr ""
|
||||
"考虑到可能存在恶意访问的风险,请确保您处于开发模式,并明确指定开发环境变量 `VLLM_SERVER_DEV_MODE` 以开放这些端点(sleep/wake up)。"
|
||||
"考虑到可能存在恶意访问的风险,请确保您处于开发模式,并明确指定开发环境变量 `VLLM_SERVER_DEV_MODE` "
|
||||
"以开放这些端点(sleep/wake up)。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -32,16 +32,16 @@ msgid ""
|
||||
"Unified Cache Management (UCM) provides an external KV-cache storage "
|
||||
"layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike "
|
||||
"KV Pooling, which expands prefix-cache capacity only by aggregating "
|
||||
"device memory and therefore remains limited by HBM/DRAM size and lacks "
|
||||
"persistence, UCM decouples compute from storage and adopts a tiered "
|
||||
"design. Each node uses local DRAM as a fast cache, while a shared "
|
||||
"device memory and therefore remains limited by on-chip memory/DRAM size "
|
||||
"and lacks persistence, UCM decouples compute from storage and adopts a "
|
||||
"tiered design. Each node uses local DRAM as a fast cache, while a shared "
|
||||
"backend—such as 3FS or enterprise-grade storage—serves as the persistent "
|
||||
"KV store. This approach removes the capacity ceiling imposed by device "
|
||||
"memory, enables durable and reliable prefix caching, and allows cache "
|
||||
"capacity to scale with the storage system rather than with compute "
|
||||
"resources."
|
||||
msgstr ""
|
||||
"统一缓存管理(UCM)为vLLM/vLLM-Ascend中的前缀缓存场景提供了一个外部的KV缓存存储层。与仅通过聚合设备内存来扩展前缀缓存容量、因此仍受限于HBM/DRAM大小且缺乏持久性的KV池化不同,UCM将计算与存储解耦,并采用分层设计。每个节点使用本地DRAM作为快速缓存,而共享后端(如3FS或企业级存储)则作为持久化的KV存储。这种方法消除了设备内存带来的容量上限,实现了持久可靠的前缀缓存,并使缓存容量能够随存储系统而非计算资源扩展。"
|
||||
"统一缓存管理(UCM)为vLLM/vLLM-Ascend中的前缀缓存场景提供了外部KV缓存存储层。与仅通过聚合设备内存扩展前缀缓存容量、因此仍受限于片上内存/DRAM大小且缺乏持久性的KV池化不同,UCM将计算与存储解耦,并采用分层设计。每个节点使用本地DRAM作为快速缓存,而共享后端(如3FS或企业级存储)则作为持久化KV存储。这种方法消除了设备内存带来的容量上限,实现了持久可靠的前缀缓存,并使缓存容量能够随存储系统而非计算资源扩展。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:7
|
||||
msgid "Prerequisites"
|
||||
@@ -73,7 +73,8 @@ msgid ""
|
||||
"NPU](https://ucm.readthedocs.io/en/latest/getting-"
|
||||
"started/quickstart_vllm_ascend.html)**"
|
||||
msgstr ""
|
||||
"**请参考[昇腾NPU的官方UCM安装指南](https://ucm.readthedocs.io/en/latest/getting-started/quickstart_vllm_ascend.html)**"
|
||||
"**请参考[昇腾NPU的官方UCM安装指南](https://ucm.readthedocs.io/en/latest/getting-"
|
||||
"started/quickstart_vllm_ascend.html)**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:18
|
||||
msgid "Configure UCM for Prefix Caching"
|
||||
@@ -96,11 +97,12 @@ msgid ""
|
||||
"documentation for prefix-caching](https://ucm.readthedocs.io/en/latest"
|
||||
"/user-guide/prefix-cache/nfs_store.html)**"
|
||||
msgstr ""
|
||||
"**有关最新的配置选项,请参考[前缀缓存的官方UCM文档](https://ucm.readthedocs.io/en/latest/user-guide/prefix-cache/nfs_store.html)**"
|
||||
"**有关最新的配置选项,请参考[前缀缓存的官方UCM文档](https://ucm.readthedocs.io/en/latest/user-"
|
||||
"guide/prefix-cache/nfs_store.html)**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:27
|
||||
msgid "A minimal configuration looks like this:"
|
||||
msgstr "一个最小配置示例如下:"
|
||||
msgstr "最小配置示例如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:39
|
||||
msgid "Explanation:"
|
||||
@@ -119,7 +121,8 @@ msgid ""
|
||||
" here. **⚠️ Make sure to replace `\"/mnt/test\"` with your actual "
|
||||
"storage directory.**"
|
||||
msgstr ""
|
||||
"storage_backends:指定用于存储KV块的目录。它可以是本地目录或NFS挂载路径。UCM将在此处存储KV块。**⚠️ 请确保将`\"/mnt/test\"`替换为您的实际存储目录。**"
|
||||
"storage_backends:指定用于存储KV块的目录。可以是本地目录或NFS挂载路径。UCM将在此处存储KV块。**⚠️ "
|
||||
"请确保将`\"/mnt/test\"`替换为您的实际存储目录。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:48
|
||||
msgid "use_direct: Whether to enable direct I/O (optional). Default is `false`."
|
||||
@@ -132,7 +135,8 @@ msgid ""
|
||||
"on Ascend, so it must be set to `false` (all ranks load/dump "
|
||||
"independently)."
|
||||
msgstr ""
|
||||
"load_only_first_rank:控制是否仅rank 0加载KV缓存并将其广播到其他rank。此功能目前在昇腾上不受支持,因此必须设置为`false`(所有rank独立加载/转储)。"
|
||||
"load_only_first_rank:控制是否仅rank "
|
||||
"0加载KV缓存并将其广播到其他rank。此功能目前在昇腾上不受支持,因此必须设置为`false`(所有rank独立加载/转储)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:55
|
||||
msgid "Launching Inference"
|
||||
@@ -143,7 +147,7 @@ msgid ""
|
||||
"In this guide, we describe **online inference** using vLLM with the UCM "
|
||||
"connector, deployed as an OpenAI-compatible server. For best performance "
|
||||
"with UCM, it is recommended to set `block_size` to 128."
|
||||
msgstr "在本指南中,我们描述使用带有UCM连接器的vLLM进行**在线推理**,部署为OpenAI兼容的服务器。为了获得UCM的最佳性能,建议将`block_size`设置为128。"
|
||||
msgstr "在本指南中,我们描述使用带有UCM连接器的vLLM进行**在线推理**,部署为OpenAI兼容的服务器。为获得UCM的最佳性能,建议将`block_size`设置为128。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:59
|
||||
msgid "To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:"
|
||||
@@ -154,7 +158,9 @@ msgid ""
|
||||
"**⚠️ Make sure to replace `\"/vllm-workspace/unified-cache-"
|
||||
"management/examples/ucm_config_example.yaml\"` with your actual config "
|
||||
"file path.**"
|
||||
msgstr "**⚠️ 请确保将`\"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml\"`替换为您的实际配置文件路径。**"
|
||||
msgstr ""
|
||||
"**⚠️ 请确保将`\"/vllm-workspace/unified-cache-"
|
||||
"management/examples/ucm_config_example.yaml\"`替换为您的实际配置文件路径。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:81
|
||||
msgid "If you see the log below:"
|
||||
@@ -176,7 +182,9 @@ msgid ""
|
||||
"way to observe the prefix caching effect is to run the built-in `vllm "
|
||||
"bench` CLI. Executing the following command **twice** in a separate "
|
||||
"terminal shows the improvement clearly."
|
||||
msgstr "在启用`UCMConnector`启动vLLM服务器后,观察前缀缓存效果的最简单方法是运行内置的`vllm bench` CLI。在单独的终端中**两次**执行以下命令可以清晰地展示改进效果。"
|
||||
msgstr ""
|
||||
"在启用`UCMConnector`启动vLLM服务器后,观察前缀缓存效果的最简单方法是运行内置的`vllm "
|
||||
"bench` CLI。在单独的终端中**两次**执行以下命令可以清晰地展示改进效果。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:112
|
||||
msgid "After the first execution"
|
||||
@@ -216,4 +224,5 @@ msgid ""
|
||||
"cached prefix significantly reduces the initial latency observed by the "
|
||||
"model, yielding an approximate **8× improvement in TTFT** compared to the"
|
||||
" initial run."
|
||||
msgstr "这表明在第二次请求期间,UCM成功从存储后端检索了全部125个缓存的KV块。利用完全缓存的前缀显著减少了模型观察到的初始延迟,与首次运行相比,TTFT实现了约**8倍的提升**。"
|
||||
msgstr ""
|
||||
"这表明在第二次请求期间,UCM成功从存储后端检索了全部125个缓存的KV块。利用完全缓存的前缀显著减少了模型观察到的初始延迟,与首次运行相比,TTFT实现了约**8倍的提升**。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,29 +35,24 @@ msgid ""
|
||||
"L2 cache ahead of time, reducing MTE utilization during the linear layer "
|
||||
"computations and indirectly improving Cube computation efficiency by "
|
||||
"minimizing resource contention and optimizing data flow."
|
||||
msgstr ""
|
||||
"权重预取通过在需要之前将权重预加载到缓存中来优化内存使用,从而最小化模型执行期间因内存访问造成的延迟。线性层有时表现出相对较高的MTE利用率。为了解决这个问题,我们创建了一个专门用于权重预取的独立流水线,该流水线与原始向量计算流水线(如量化、MoE门控top_k、RMSNorm和SwiGlu)并行运行。这种方法允许权重提前预加载到L2缓存中,减少线性层计算期间的MTE利用率,并通过最小化资源争用和优化数据流间接提高Cube计算效率。"
|
||||
msgstr "权重预取通过在需要之前将权重预加载到缓存中来优化内存使用,从而最小化模型执行期间因内存访问造成的延迟。线性层有时表现出相对较高的MTE利用率。为了解决这个问题,我们创建了一个专门用于权重预取的独立流水线,该流水线与原始向量计算流水线(如量化、MoE门控top_k、RMSNorm和SwiGlu)并行运行。这种方法允许权重提前预加载到L2缓存中,减少线性层计算期间的MTE利用率,并通过最小化资源争用和优化数据流间接提高Cube计算效率。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:5
|
||||
msgid ""
|
||||
"Since we use vector computations to hide the weight prefetching pipeline,"
|
||||
" this has an effect on computation. If you prioritize low latency over "
|
||||
"high throughput, it is best not to enable prefetching."
|
||||
msgstr ""
|
||||
"由于我们使用向量计算来隐藏权重预取流水线,这会对计算产生影响。如果您优先考虑低延迟而非高吞吐量,最好不要启用预取。"
|
||||
msgstr "由于我们使用向量计算来隐藏权重预取流水线,这会对计算产生影响。如果您优先考虑低延迟而非高吞吐量,最好不要启用预取。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:7
|
||||
msgid "Quick Start"
|
||||
msgstr "快速开始"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:9
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"With `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
|
||||
"true}}'` to open weight prefetch."
|
||||
msgstr ""
|
||||
"使用 `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
|
||||
"true}}'` 来开启权重预取。"
|
||||
"Use `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
|
||||
"true}}'` to enable weight prefetch."
|
||||
msgstr "使用 `--additional-config '{\"weight_prefetch_config\": {\"enabled\": true}}'` 来开启权重预取。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:11
|
||||
msgid "Fine-tune Prefetch Ratio"
|
||||
@@ -72,25 +67,21 @@ msgid ""
|
||||
"performance degradation. To accommodate different scenarios, we have "
|
||||
"added `prefetch_ratio` to allow for flexible size configuration based on "
|
||||
"the specific workload, details as follows:"
|
||||
msgstr ""
|
||||
"由于权重预取使用向量计算来隐藏权重预取流水线,预取大小的设置至关重要。如果大小太小,则无法充分发挥优化优势;而较大的大小可能导致资源争用,从而导致性能下降。为了适应不同的场景,我们添加了`prefetch_ratio`,允许根据具体工作负载灵活配置大小,详情如下:"
|
||||
msgstr "由于权重预取使用向量计算来隐藏权重预取流水线,预取大小的设置至关重要。如果大小太小,则无法充分发挥优化优势;而较大的大小可能导致资源争用,从而导致性能下降。为了适应不同的场景,我们添加了`prefetch_ratio`,允许根据具体工作负载灵活配置大小,详情如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:15
|
||||
msgid ""
|
||||
"With `prefetch_ratio` in `\"weight_prefetch_config\"` to custom the "
|
||||
"weight prefetch ratio for specific linear layers."
|
||||
msgstr ""
|
||||
"使用`\"weight_prefetch_config\"`中的`prefetch_ratio`来为特定的线性层自定义权重预取比例。"
|
||||
msgstr "使用`\"weight_prefetch_config\"`中的`prefetch_ratio`来为特定的线性层自定义权重预取比例。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:17
|
||||
msgid ""
|
||||
"The “attn” and “moe” configuration options are used for MoE model, "
|
||||
"details as follows:"
|
||||
msgstr ""
|
||||
"“attn”和“moe”配置选项用于MoE模型,详情如下:"
|
||||
msgstr "“attn”和“moe”配置选项用于MoE模型,详情如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:19
|
||||
#, python-brace-format
|
||||
msgid "`\"attn\": { \"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}`"
|
||||
msgstr "`\"attn\": { \"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}`"
|
||||
|
||||
@@ -98,11 +89,9 @@ msgstr "`\"attn\": { \"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}`"
|
||||
msgid ""
|
||||
"The “mlp” configuration option is used to optimize the performance of the"
|
||||
" Dense model, details as follows:"
|
||||
msgstr ""
|
||||
"“mlp”配置选项用于优化Dense模型的性能,详情如下:"
|
||||
msgstr "“mlp”配置选项用于优化Dense模型的性能,详情如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:23
|
||||
#, python-brace-format
|
||||
msgid "`\"mlp\": {\"gate_up\": 1.0, \"down\": 1.0}`"
|
||||
msgstr "`\"mlp\": {\"gate_up\": 1.0, \"down\": 1.0}`"
|
||||
|
||||
@@ -111,8 +100,7 @@ msgid ""
|
||||
"Above value are the default config, the default value has a good "
|
||||
"performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs` is 144, for "
|
||||
"Qwen3-32B-W8A8 when `--max-num-seqs` is 72."
|
||||
msgstr ""
|
||||
"以上值为默认配置,当`--max-num-seqs`为144时,该默认值对Qwen3-235B-A22B-W8A8有良好性能;当`--max-num-seqs`为72时,对Qwen3-32B-W8A8有良好性能。"
|
||||
msgstr "以上值为默认配置,当`--max-num-seqs`为144时,该默认值对Qwen3-235B-A22B-W8A8有良好性能;当`--max-num-seqs`为72时,对Qwen3-32B-W8A8有良好性能。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:27
|
||||
msgid ""
|
||||
@@ -128,8 +116,7 @@ msgid ""
|
||||
"computation operator. In the profiling timeline, a prefetch operation "
|
||||
"appears as a CMO operation on a single stream; this CMO operation is the "
|
||||
"prefetch operation."
|
||||
msgstr ""
|
||||
"然而,这可能不是您场景下的最优配置。对于更高的并发度,可以尝试增加预取大小。对于较低的并发度,预取可能不会带来任何优势,因此可以减少大小或禁用预取。通过收集性能分析数据来确定预取大小是否合适。具体来说,检查预取操作(例如,MLP Down Proj权重预取)所需的时间是否与并行向量计算算子(例如,SwiGlu计算)所需的时间重叠,以及预取操作是否不晚于向量计算算子的完成时间。在性能分析时间线中,预取操作显示为单个流上的CMO操作;此CMO操作即为预取操作。"
|
||||
msgstr "然而,这可能不是您场景下的最优配置。对于更高的并发度,可以尝试增加预取大小。对于较低的并发度,预取可能不会带来任何优势,因此可以减少大小或禁用预取。通过收集性能分析数据来确定预取大小是否合适。具体来说,检查预取操作(例如,MLP Down Proj权重预取)所需的时间是否与并行向量计算算子(例如,SwiGlu计算)所需的时间重叠,以及预取操作是否不晚于向量计算算子的完成时间。在性能分析时间线中,预取操作显示为单个流上的CMO操作;此CMO操作即为预取操作。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:29
|
||||
msgid "Notes:"
|
||||
@@ -140,16 +127,14 @@ msgid ""
|
||||
"Weight prefetch of MLP `down` project prefetch depends on sequence "
|
||||
"parallel, if you want to open for mlp `down` please also enable sequence "
|
||||
"parallel."
|
||||
msgstr ""
|
||||
"MLP `down`投影的权重预取依赖于序列并行,如果您想为mlp `down`开启预取,请同时启用序列并行。"
|
||||
msgstr "MLP `down`投影的权重预取依赖于序列并行,如果您想为mlp `down`开启预取,请同时启用序列并行。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:32
|
||||
msgid ""
|
||||
"Due to the current size of the L2 cache, the maximum prefetch cannot "
|
||||
"exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 *"
|
||||
" 1024` bytes, the backend will only prefetch 18MB."
|
||||
msgstr ""
|
||||
"由于当前L2缓存的大小,最大预取量不能超过18MB。如果`prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024`字节,后端将只预取18MB。"
|
||||
msgstr "由于当前L2缓存的大小,最大预取量不能超过18MB。如果`prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024`字节,后端将只预取18MB。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:34
|
||||
msgid "Example"
|
||||
@@ -167,5 +152,4 @@ msgstr "对于Dense模型:"
|
||||
msgid ""
|
||||
"Following is the default configuration that can get a good performance "
|
||||
"for `--max-num-seqs` is 72 for Qwen3-32B-W8A8"
|
||||
msgstr ""
|
||||
"以下是默认配置,当`--max-num-seqs`为72时,该配置可为Qwen3-32B-W8A8带来良好性能"
|
||||
msgstr "以下是默认配置,当`--max-num-seqs`为72时,该配置可为Qwen3-32B-W8A8带来良好性能"
|
||||
Reference in New Issue
Block a user