[v0.18.0][Doc] Translated Doc files 2026-04-15 (#8309)
## Auto-Translation Summary Translated **19** file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/contributors.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/ModelRunner_prepare_inputs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Kimi-K2.5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen2.5-Omni.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/epd_disaggregation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/external_dp.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/large_scale_ep.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/release_notes.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24447109402) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
This commit is contained in:
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend\n"
|
"Project-Id-Version: vllm-ascend\n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -32,7 +32,7 @@ msgid "Name"
|
|||||||
msgstr "姓名"
|
msgstr "姓名"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "Github ID"
|
msgid "GitHub ID"
|
||||||
msgstr "GitHub ID"
|
msgstr "GitHub ID"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
@@ -917,6 +917,14 @@ msgstr "306"
|
|||||||
msgid "[@mengchengTang](https://github.com/mengchengTang)"
|
msgid "[@mengchengTang](https://github.com/mengchengTang)"
|
||||||
msgstr "[@mengchengTang](https://github.com/mengchengTang)"
|
msgstr "[@mengchengTang](https://github.com/mengchengTang)"
|
||||||
|
|
||||||
|
#: ../../source/community/contributors.md
|
||||||
|
msgid ""
|
||||||
|
"[41eb71d](https://github.com/vllm-project/vllm-"
|
||||||
|
"ascend/commit/41eb71d665ab9f0b72b6d3bc15d41dee7fcc0f5f)"
|
||||||
|
msgstr ""
|
||||||
|
"[41eb71d](https://github.com/vllm-project/vllm-"
|
||||||
|
"ascend/commit/41eb71d665ab9f0b72b6d3bc15d41dee7fcc0f5f)"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "305"
|
msgid "305"
|
||||||
msgstr "305"
|
msgstr "305"
|
||||||
@@ -2611,7 +2619,7 @@ msgstr "[@wangxiaochao6](https://github.com/wangxiaochao6)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/11/18"
|
msgid "2025/11/18"
|
||||||
msgstr "2025/11/18"
|
msgstr "2025年11月18日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -2631,7 +2639,7 @@ msgstr "[@845473182](https://github.com/845473182)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/11/14"
|
msgid "2025/11/14"
|
||||||
msgstr "2025/11/14"
|
msgstr "2025年11月14日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -2651,7 +2659,7 @@ msgstr "[@thonean](https://github.com/thonean)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/11/12"
|
msgid "2025/11/12"
|
||||||
msgstr "2025/11/12"
|
msgstr "2025年11月12日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -2671,7 +2679,7 @@ msgstr "[@zhaomingyu13](https://github.com/zhaomingyu13)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/11/11"
|
msgid "2025/11/11"
|
||||||
msgstr "2025/11/11"
|
msgstr "2025年11月11日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3043,7 +3051,7 @@ msgstr "[@yzy1996](https://github.com/yzy1996)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/23"
|
msgid "2025/10/23"
|
||||||
msgstr "2025/10/23"
|
msgstr "2025年10月23日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3111,7 +3119,7 @@ msgstr "[@KyrieDrewWang](https://github.com/KyrieDrewWang)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/22"
|
msgid "2025/10/22"
|
||||||
msgstr "2025/10/22"
|
msgstr "2025年10月22日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3147,7 +3155,7 @@ msgstr "[@drslark](https://github.com/drslark)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/21"
|
msgid "2025/10/21"
|
||||||
msgstr "2025/10/21"
|
msgstr "2025年10月21日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3183,7 +3191,7 @@ msgstr "[@leijie-ww](https://github.com/leijie-ww)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/20"
|
msgid "2025/10/20"
|
||||||
msgstr "2025/10/20"
|
msgstr "2025年10月20日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3203,7 +3211,7 @@ msgstr "[@ZYang6263](https://github.com/ZYang6263)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/19"
|
msgid "2025/10/19"
|
||||||
msgstr "2025/10/19"
|
msgstr "2025年10月19日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3223,7 +3231,7 @@ msgstr "[@yechao237](https://github.com/yechao237)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/18"
|
msgid "2025/10/18"
|
||||||
msgstr "2025/10/18"
|
msgstr "2025年10月18日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3259,7 +3267,7 @@ msgstr "[@DreamerLeader](https://github.com/DreamerLeader)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/15"
|
msgid "2025/10/15"
|
||||||
msgstr "2025/10/15"
|
msgstr "2025年10月15日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3279,7 +3287,7 @@ msgstr "[@yuzhup](https://github.com/yuzhup)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/14"
|
msgid "2025/10/14"
|
||||||
msgstr "2025/10/14"
|
msgstr "2025年10月14日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -3331,7 +3339,7 @@ msgstr "[@dsxsteven](https://github.com/dsxsteven)"
|
|||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid "2025/10/13"
|
msgid "2025/10/13"
|
||||||
msgstr "2025/10/13"
|
msgstr "2025年10月13日"
|
||||||
|
|
||||||
#: ../../source/community/contributors.md
|
#: ../../source/community/contributors.md
|
||||||
msgid ""
|
msgid ""
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend\n"
|
"Project-Id-Version: vllm-ascend\n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -72,8 +72,9 @@ msgid ""
|
|||||||
"(`v[major].[minor].[micro]`). Any post version must be published as a "
|
"(`v[major].[minor].[micro]`). Any post version must be published as a "
|
||||||
"patch version of the final release."
|
"patch version of the final release."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"**后续版本**:通常**按需发布**,用于解决正式版本中的小错误。与 [PEP-440 后续版本说明](https://peps.python.org/pep-"
|
"**后续版本**:通常**按需发布**,用于解决正式版本中的小错误。与 [PEP-440 "
|
||||||
"0440/#post-releases) 的惯例不同,这些版本包含实际的错误修复,因为正式发布版本必须严格与 vLLM 的正式发布格式 "
|
"后续版本说明](https://peps.python.org/pep-0440/#post-releases) "
|
||||||
|
"的惯例不同,这些版本包含实际的错误修复,因为正式发布版本必须严格与 vLLM 的正式发布格式 "
|
||||||
"(`v[major].[minor].[micro]`) 对齐。任何后续版本都必须作为正式版本的补丁版本发布。"
|
"(`v[major].[minor].[micro]`) 对齐。任何后续版本都必须作为正式版本的补丁版本发布。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:14
|
#: ../../source/community/versioning_policy.md:14
|
||||||
@@ -379,14 +380,17 @@ msgstr "v0.7.3"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"If you're using v0.7.3, don't forget to install [mindie-"
|
"If you're using v0.7.3, don't forget to install [mindie-"
|
||||||
"turbo](https://pypi.org/project/mindie-turbo) as well."
|
"turbo](https://pypi.org/project/mindie-turbo) as well."
|
||||||
msgstr "如果您正在使用 v0.7.3,请别忘了同时安装 [mindie-turbo](https://pypi.org/project/mindie-turbo)。"
|
msgstr ""
|
||||||
|
"如果您正在使用 v0.7.3,请别忘了同时安装 [mindie-turbo](https://pypi.org/project/mindie-turbo)。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:58
|
#: ../../source/community/versioning_policy.md:58
|
||||||
msgid ""
|
msgid ""
|
||||||
"For main branch of vLLM Ascend, we usually make it compatible with the "
|
"For main branch of vLLM Ascend, we usually make it compatible with the "
|
||||||
"latest vLLM release and a newer commit hash of vLLM. Please note that "
|
"latest vLLM release and a newer commit hash of vLLM. Please note that "
|
||||||
"this table is usually updated. Please check it regularly."
|
"this table is usually updated. Please check it regularly."
|
||||||
msgstr "对于 vLLM Ascend 的 main 分支,我们通常会使其与最新的 vLLM 发布版本以及更新的 vLLM 提交哈希兼容。请注意,此表格会经常更新,请定期查看。"
|
msgstr ""
|
||||||
|
"对于 vLLM Ascend 的 main 分支,我们通常会使其与最新的 vLLM 发布版本以及更新的 vLLM "
|
||||||
|
"提交哈希兼容。请注意,此表格会经常更新,请定期查看。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "main"
|
msgid "main"
|
||||||
@@ -683,7 +687,9 @@ msgid ""
|
|||||||
"**releases/vX.Y.Z**: development branch, created with part of new "
|
"**releases/vX.Y.Z**: development branch, created with part of new "
|
||||||
"releases of vLLM. For example, `releases/v0.13.0` is the dev branch for "
|
"releases of vLLM. For example, `releases/v0.13.0` is the dev branch for "
|
||||||
"vLLM `v0.13.0` version."
|
"vLLM `v0.13.0` version."
|
||||||
msgstr "**releases/vX.Y.Z**:开发分支,随 vLLM 新版本的一部分创建。例如,`releases/v0.13.0` 是 vLLM `v0.13.0` 版本的开发分支。"
|
msgstr ""
|
||||||
|
"**releases/vX.Y.Z**:开发分支,随 vLLM 新版本的一部分创建。例如,`releases/v0.13.0` 是 vLLM "
|
||||||
|
"`v0.13.0` 版本的开发分支。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:109
|
#: ../../source/community/versioning_policy.md:109
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -760,7 +766,10 @@ msgid ""
|
|||||||
"do not (e.g. `releases/v0.12.0`). The vLLM Ascend release branch now "
|
"do not (e.g. `releases/v0.12.0`). The vLLM Ascend release branch now "
|
||||||
"follows the `releases/vX.Y.Z` naming convention, replacing the previous "
|
"follows the `releases/vX.Y.Z` naming convention, replacing the previous "
|
||||||
"`vX.Y.Z-dev` format to align with vLLM's branch naming standards."
|
"`vX.Y.Z-dev` format to align with vLLM's branch naming standards."
|
||||||
msgstr "请注意,vLLM Ascend 仅针对特定的 vLLM 发布版本进行发布,而非每个版本。因此,您可能会注意到某些版本有对应的开发分支(例如 `releases/v0.13.0`),而其他版本则没有(例如 `releases/v0.12.0`)。vLLM Ascend 的发布分支现在遵循 `releases/vX.Y.Z` 命名约定,取代了之前的 `vX.Y.Z-dev` 格式,以与 vLLM 的分支命名标准保持一致。"
|
msgstr ""
|
||||||
|
"请注意,vLLM Ascend 仅针对特定的 vLLM 发布版本进行发布,而非每个版本。因此,您可能会注意到某些版本有对应的开发分支(例如 "
|
||||||
|
"`releases/v0.13.0`),而其他版本则没有(例如 `releases/v0.12.0`)。vLLM Ascend 的发布分支现在遵循"
|
||||||
|
" `releases/vX.Y.Z` 命名约定,取代了之前的 `vX.Y.Z-dev` 格式,以与 vLLM 的分支命名标准保持一致。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:125
|
#: ../../source/community/versioning_policy.md:125
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -910,7 +919,10 @@ msgid ""
|
|||||||
" indicate that they have installed a dev or editable version of vLLM "
|
" indicate that they have installed a dev or editable version of vLLM "
|
||||||
"package. In this case, we provide the env variable `VLLM_VERSION` to let "
|
"package. In this case, we provide the env variable `VLLM_VERSION` to let "
|
||||||
"users specify the version of vLLM package to use."
|
"users specify the version of vLLM package to use."
|
||||||
msgstr "为确保代码更改与最新的 1 或 2 个 vLLM 发布版本兼容,vLLM Ascend 在代码中引入了版本检查机制。它首先检查已安装的 vLLM 包的版本,以决定使用哪段代码逻辑。如果用户遇到 `InvalidVersion` 错误,可能表明他们安装了开发版或可编辑版本的 vLLM 包。在这种情况下,我们提供了环境变量 `VLLM_VERSION`,允许用户指定要使用的 vLLM 包版本。"
|
msgstr ""
|
||||||
|
"为确保代码更改与最新的 1 或 2 个 vLLM 发布版本兼容,vLLM Ascend 在代码中引入了版本检查机制。它首先检查已安装的 vLLM "
|
||||||
|
"包的版本,以决定使用哪段代码逻辑。如果用户遇到 `InvalidVersion` 错误,可能表明他们安装了开发版或可编辑版本的 vLLM "
|
||||||
|
"包。在这种情况下,我们提供了环境变量 `VLLM_VERSION`,允许用户指定要使用的 vLLM 包版本。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:154
|
#: ../../source/community/versioning_policy.md:154
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -929,7 +941,10 @@ msgid ""
|
|||||||
"variables in [docs/source/conf.py](https://github.com/vllm-project/vllm-"
|
"variables in [docs/source/conf.py](https://github.com/vllm-project/vllm-"
|
||||||
"ascend/blob/main/docs/source/conf.py)**. While this is not a simple task,"
|
"ascend/blob/main/docs/source/conf.py)**. While this is not a simple task,"
|
||||||
" it is a principle we should strive to follow."
|
" it is a principle we should strive to follow."
|
||||||
msgstr "为降低维护成本,**所有分支的文档内容应保持一致,版本差异可通过 [docs/source/conf.py](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/conf.py) 中的变量进行控制**。虽然这并非易事,但这是我们应努力遵循的原则。"
|
msgstr ""
|
||||||
|
"为降低维护成本,**所有分支的文档内容应保持一致,版本差异可通过 [docs/source/conf.py](https://github.com"
|
||||||
|
"/vllm-project/vllm-ascend/blob/main/docs/source/conf.py) "
|
||||||
|
"中的变量进行控制**。虽然这并非易事,但这是我们应努力遵循的原则。"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "Version"
|
msgid "Version"
|
||||||
@@ -945,7 +960,7 @@ msgstr "代码分支"
|
|||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "latest"
|
msgid "latest"
|
||||||
msgstr "最新"
|
msgstr "latest"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "Doc for the latest rc release of main branch"
|
msgid "Doc for the latest rc release of main branch"
|
||||||
@@ -957,7 +972,7 @@ msgstr "`main` 分支"
|
|||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "rc version"
|
msgid "rc version"
|
||||||
msgstr "候选版本"
|
msgstr "rc version"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "Doc for RC released versions"
|
msgid "Doc for RC released versions"
|
||||||
@@ -969,7 +984,7 @@ msgstr "`vX.Y.ZrcN` --> `vX.Y.ZrcN` 标签"
|
|||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "version"
|
msgid "version"
|
||||||
msgstr "版本"
|
msgstr "version"
|
||||||
|
|
||||||
#: ../../source/community/versioning_policy.md:54
|
#: ../../source/community/versioning_policy.md:54
|
||||||
msgid "Doc for historical released versions"
|
msgid "Doc for historical released versions"
|
||||||
@@ -1004,77 +1019,14 @@ msgstr "软件依赖管理"
|
|||||||
#: ../../source/community/versioning_policy.md:174
|
#: ../../source/community/versioning_policy.md:174
|
||||||
msgid ""
|
msgid ""
|
||||||
"`torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable "
|
"`torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable "
|
||||||
"version to [PyPi](https://pypi.org/project/torch-npu) every 3 months, a "
|
"version to [PyPI](https://pypi.org/project/torch-npu) every 3 months, a "
|
||||||
"development version (aka the POC version) every month, and a nightly "
|
"development version (aka the POC version) every month, and a nightly "
|
||||||
"version every day. The PyPi stable version **CAN** be used in vLLM Ascend"
|
"version every day. The PyPI stable version **CAN** be used in vLLM Ascend"
|
||||||
" final version, the monthly dev version **ONLY CAN** be used in vLLM "
|
" final version, the monthly dev version **ONLY CAN** be used in vLLM "
|
||||||
"Ascend RC version for rapid iteration, and the nightly version **CANNOT**"
|
"Ascend RC version for rapid iteration, and the nightly version **CANNOT**"
|
||||||
" be used in vLLM Ascend any version or branch."
|
" be used in vLLM Ascend any version or branch."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`torch-npu`:Ascend Extension for PyTorch(torch-npu)每 3 个月在 "
|
"`torch-npu`:Ascend Extension for PyTorch(torch-npu)每 3 个月在 "
|
||||||
"[PyPi](https://pypi.org/project/torch-npu) 发布一个稳定版本,每月发布一个开发版本(亦称 POC 版本),每日发布一个 "
|
"[PyPI](https://pypi.org/project/torch-npu) 发布一个稳定版本,每月发布一个开发版本(亦称 POC "
|
||||||
"nightly 版本。PyPi 稳定版本**可以**用于 vLLM Ascend 正式版,月度开发版本**仅能**用于 vLLM Ascend RC "
|
"版本),每日发布一个 nightly 版本。PyPI 稳定版本**可以**用于 vLLM Ascend 正式版,月度开发版本**仅能**用于 "
|
||||||
"版本以进行快速迭代,nightly 版本**不能**用于 vLLM Ascend 的任何版本或分支。"
|
"vLLM Ascend RC 版本以进行快速迭代,nightly 版本**不能**用于 vLLM Ascend 的任何版本或分支。"
|
||||||
|
|
||||||
#~ msgid "MindIE Turbo"
|
|
||||||
#~ msgstr "MindIE Turbo"
|
|
||||||
|
|
||||||
#~ msgid "2.0rc1"
|
|
||||||
#~ msgstr "2.0候选版本1"
|
|
||||||
|
|
||||||
#~ msgid "The branch status will be in one of the following states:"
|
|
||||||
#~ msgstr "分支状态将处于以下几种状态之一:"
|
|
||||||
|
|
||||||
#~ msgid ""
|
|
||||||
#~ "Note that vLLM Ascend will only be"
|
|
||||||
#~ " released for a certain vLLM release"
|
|
||||||
#~ " version rather than all versions. "
|
|
||||||
#~ "Hence, You might see only part of"
|
|
||||||
#~ " versions have dev branches (such as"
|
|
||||||
#~ " only `0.7.1-dev` / `0.7.3-dev` but "
|
|
||||||
#~ "no `0.7.2-dev`), this is as expected."
|
|
||||||
#~ msgstr ""
|
|
||||||
#~ "请注意,vLLM Ascend 仅会针对特定的 vLLM "
|
|
||||||
#~ "发布版本进行发布,而非所有版本。因此,您可能只会看到部分版本拥有开发分支(例如仅有 `0.7.1-dev` / `0.7.3-dev`,而没有 "
|
|
||||||
#~ "`0.7.2-dev`),这是正常现象。"
|
|
||||||
|
|
||||||
#~ msgid "Doc for the latest dev branch"
|
|
||||||
#~ msgstr "最新开发分支的文档"
|
|
||||||
|
|
||||||
#~ msgid "vX.Y.Z-dev (Will be `main` after the first final release)"
|
|
||||||
#~ msgstr "vX.Y.Z-dev(在首次正式发布后将成为 `main`)"
|
|
||||||
|
|
||||||
#~ msgid "Git tags, like vX.Y.Z[rcN]"
|
|
||||||
#~ msgstr "Git 标签,如 vX.Y.Z[rcN]"
|
|
||||||
|
|
||||||
#~ msgid "stable(not yet released)"
|
|
||||||
#~ msgstr "稳定版(尚未发布)"
|
|
||||||
|
|
||||||
#~ msgid "Will be `vX.Y.Z-dev` after the first official release"
|
|
||||||
#~ msgstr "首次正式发布后将会是 `vX.Y.Z-dev`"
|
|
||||||
|
|
||||||
#~ msgid "As shown above:"
|
|
||||||
#~ msgstr "如上所示:"
|
|
||||||
|
|
||||||
#~ msgid ""
|
|
||||||
#~ "`latest` documentation: Matches the current"
|
|
||||||
#~ " maintenance branch `vX.Y.Z-dev` (Will be"
|
|
||||||
#~ " `main` after the first final "
|
|
||||||
#~ "release). Continuously updated to ensure "
|
|
||||||
#~ "usability for the latest release."
|
|
||||||
#~ msgstr "`latest` 文档:匹配当前维护分支 `vX.Y.Z-dev`(在首次正式发布后将成为 `main`)。持续更新以确保适用于最新发布版本。"
|
|
||||||
|
|
||||||
#~ msgid ""
|
|
||||||
#~ "`stable` documentation (**not yet released**):"
|
|
||||||
#~ " Official release documentation. Updates "
|
|
||||||
#~ "are allowed in real-time after "
|
|
||||||
#~ "release, typically based on vX.Y.Z-dev. "
|
|
||||||
#~ "Once stable documentation is available, "
|
|
||||||
#~ "non-stable versions should display a "
|
|
||||||
#~ "header warning: `You are viewing the "
|
|
||||||
#~ "latest developer preview docs. Click "
|
|
||||||
#~ "here to view docs for the latest"
|
|
||||||
#~ " stable release.`."
|
|
||||||
#~ msgstr ""
|
|
||||||
#~ "`stable` 文档(**尚未发布**):官方发布版文档。发布后允许实时更新,通常基于 "
|
|
||||||
#~ "vX.Y.Z-dev。一旦稳定版文档可用,非稳定版本应显示一个顶部警告:`您正在查看最新的开发预览文档。点击此处查看最新稳定版本文档。`"
|
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -46,32 +46,42 @@ msgid ""
|
|||||||
"including HBM, DRAM, and SSD, making a pool for KV Cache storage while "
|
"including HBM, DRAM, and SSD, making a pool for KV Cache storage while "
|
||||||
"making the prefix of requests visible across all nodes, increasing the "
|
"making the prefix of requests visible across all nodes, increasing the "
|
||||||
"cache hit rate for all requests."
|
"cache hit rate for all requests."
|
||||||
msgstr "因此,我们提出了 KV 缓存池,旨在利用包括 HBM、DRAM 和 SSD 在内的多种存储类型,构建一个 KV 缓存存储池,同时使请求的前缀在所有节点间可见,从而提高所有请求的缓存命中率。"
|
msgstr ""
|
||||||
|
"因此,我们提出了 KV 缓存池,旨在利用包括 HBM、DRAM 和 SSD 在内的多种存储类型,构建一个 KV "
|
||||||
|
"缓存存储池,同时使请求的前缀在所有节点间可见,从而提高所有请求的缓存命中率。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:11
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:11
|
||||||
msgid ""
|
msgid ""
|
||||||
"vLLM Ascend currently supports [MooncakeStore](https://github.com"
|
"vLLM Ascend currently supports [MooncakeStore](https://github.com"
|
||||||
"/kvcache-ai/Mooncake), one of the most recognized KV Cache storage "
|
"/kvcache-ai/Mooncake), one of the most recognized KV Cache storage "
|
||||||
"engines."
|
"engines."
|
||||||
msgstr "vLLM Ascend 目前支持 [MooncakeStore](https://github.com/kvcache-ai/Mooncake),这是最受认可的 KV 缓存存储引擎之一。"
|
msgstr ""
|
||||||
|
"vLLM Ascend 目前支持 [MooncakeStore](https://github.com/kvcache-"
|
||||||
|
"ai/Mooncake),这是最受认可的 KV 缓存存储引擎之一。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:13
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:13
|
||||||
msgid ""
|
msgid ""
|
||||||
"While one can utilize Mooncake Store in vLLM V1 engine by setting it as a"
|
"While one can utilize MooncakeStore in vLLM V1 engine by setting it as a "
|
||||||
" remote backend of LMCache with GPU (see "
|
"remote backend of LMCache with GPU (see "
|
||||||
"[Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)),"
|
"[Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)),"
|
||||||
" we find it would be better to integrate a connector that directly "
|
" we find it would be better to integrate a connector that directly "
|
||||||
"supports Mooncake Store and can utilize the data transfer strategy that "
|
"supports MooncakeStore and can utilize the data transfer strategy that "
|
||||||
"best fits Huawei NPU hardware."
|
"best fits Huawei NPU hardware."
|
||||||
msgstr "虽然可以通过将 Mooncake Store 设置为 GPU 上 LMCache 的远程后端来在 vLLM V1 引擎中使用它(参见[教程](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)),但我们认为集成一个直接支持 Mooncake Store 并能利用最适合华为 NPU 硬件的数据传输策略的连接器会更好。"
|
msgstr ""
|
||||||
|
"虽然可以通过将 MooncakeStore 设置为 GPU 上 LMCache 的远程后端来在 vLLM V1 "
|
||||||
|
"引擎中使用它(参见[教程](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)),但我们认为集成一个直接支持"
|
||||||
|
" MooncakeStore 并能利用最适合华为 NPU 硬件的数据传输策略的连接器会更好。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:15
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:15
|
||||||
msgid ""
|
msgid ""
|
||||||
"Hence, we propose to integrate Mooncake Store with a brand new "
|
"Hence, we propose to integrate MooncakeStore with a brand new "
|
||||||
"**MooncakeStoreConnectorV1**, which is indeed largely inspired by "
|
"**MooncakeStoreConnectorV1**, which is indeed largely inspired by "
|
||||||
"**LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 "
|
"**LMCacheConnectorV1** (see the `How is MooncakeStoreConnectorV1 "
|
||||||
"Implemented?` section)."
|
"Implemented?` section)."
|
||||||
msgstr "因此,我们提议将 Mooncake Store 与全新的 **MooncakeStoreConnectorV1** 集成,该连接器的设计在很大程度上受到了 **LMCacheConnectorV1** 的启发(参见 `MooncakeStoreConnectorV1 是如何实现的?` 部分)。"
|
msgstr ""
|
||||||
|
"因此,我们提议将 MooncakeStore 与全新的 **MooncakeStoreConnectorV1** "
|
||||||
|
"集成,该连接器的设计在很大程度上受到了 **LMCacheConnectorV1** 的启发(参见 "
|
||||||
|
"`MooncakeStoreConnectorV1 是如何实现的?` 部分)。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:17
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:17
|
||||||
msgid "Usage"
|
msgid "Usage"
|
||||||
@@ -79,17 +89,21 @@ msgstr "使用方法"
|
|||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:19
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:19
|
||||||
msgid ""
|
msgid ""
|
||||||
"vLLM Ascend currently supports Mooncake Store for KV Cache Pool. To "
|
"vLLM Ascend currently supports MooncakeStore for KV Cache Pool. To enable"
|
||||||
"enable Mooncake Store, one needs to configure `kv-transfer-config` and "
|
" MooncakeStore, one needs to configure `kv-transfer-config` and choose "
|
||||||
"choose `MooncakeStoreConnector` as the KV Connector."
|
"`MooncakeStoreConnector` as the KV Connector."
|
||||||
msgstr "vLLM Ascend 目前支持使用 Mooncake Store 作为 KV 缓存池。要启用 Mooncake Store,需要配置 `kv-transfer-config` 并选择 `MooncakeStoreConnector` 作为 KV 连接器。"
|
msgstr ""
|
||||||
|
"vLLM Ascend 目前支持使用 MooncakeStore 作为 KV 缓存池。要启用 MooncakeStore,需要配置 `kv-"
|
||||||
|
"transfer-config` 并选择 `MooncakeStoreConnector` 作为 KV 连接器。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:21
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:21
|
||||||
msgid ""
|
msgid ""
|
||||||
"For step-by-step deployment and configuration, please refer to the [KV "
|
"For step-by-step deployment and configuration, please refer to the [KV "
|
||||||
"Pool User "
|
"Pool User "
|
||||||
"Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html)."
|
"Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html)."
|
||||||
msgstr "关于逐步部署和配置,请参考 [KV 池用户指南](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html)。"
|
msgstr ""
|
||||||
|
"关于逐步部署和配置,请参考 [KV "
|
||||||
|
"池用户指南](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html)。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:23
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:23
|
||||||
msgid "How it works?"
|
msgid "How it works?"
|
||||||
@@ -114,7 +128,9 @@ msgid ""
|
|||||||
"efficient caching both locally (in HBM) and globally (via Mooncake), "
|
"efficient caching both locally (in HBM) and globally (via Mooncake), "
|
||||||
"ensuring that frequently used prefixes remain hot while less frequently "
|
"ensuring that frequently used prefixes remain hot while less frequently "
|
||||||
"accessed KV data can spill over to lower-cost memory."
|
"accessed KV data can spill over to lower-cost memory."
|
||||||
msgstr "当与 vLLM 的前缀缓存机制结合时,该池能够实现本地(HBM 中)和全局(通过 Mooncake)的高效缓存,确保常用前缀保持热状态,而访问频率较低的 KV 数据则可以溢出到成本更低的内存中。"
|
msgstr ""
|
||||||
|
"当与 vLLM 的前缀缓存机制结合时,该池能够实现本地(HBM 中)和全局(通过 "
|
||||||
|
"Mooncake)的高效缓存,确保常用前缀保持热状态,而访问频率较低的 KV 数据则可以溢出到成本更低的内存中。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:31
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:31
|
||||||
msgid "1. Combining KV Cache Pool with HBM Prefix Caching"
|
msgid "1. Combining KV Cache Pool with HBM Prefix Caching"
|
||||||
@@ -125,7 +141,9 @@ msgid ""
|
|||||||
"Prefix Caching with HBM is already supported by the vLLM V1 Engine. By "
|
"Prefix Caching with HBM is already supported by the vLLM V1 Engine. By "
|
||||||
"introducing KV Connector V1, users can seamlessly combine HBM-based "
|
"introducing KV Connector V1, users can seamlessly combine HBM-based "
|
||||||
"Prefix Caching with Mooncake-backed KV Pool."
|
"Prefix Caching with Mooncake-backed KV Pool."
|
||||||
msgstr "vLLM V1 引擎已支持基于 HBM 的前缀缓存。通过引入 KV Connector V1,用户可以无缝地将基于 HBM 的前缀缓存与 Mooncake 支持的 KV 池结合起来。"
|
msgstr ""
|
||||||
|
"vLLM V1 引擎已支持基于 HBM 的前缀缓存。通过引入 KV Connector V1,用户可以无缝地将基于 HBM 的前缀缓存与 "
|
||||||
|
"Mooncake 支持的 KV 池结合起来。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:36
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:36
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -133,7 +151,9 @@ msgid ""
|
|||||||
"which is enabled by default in vLLM V1 unless the "
|
"which is enabled by default in vLLM V1 unless the "
|
||||||
"`--no_enable_prefix_caching` flag is set, and setting up the KV Connector"
|
"`--no_enable_prefix_caching` flag is set, and setting up the KV Connector"
|
||||||
" for KV Pool (e.g., the MooncakeStoreConnector)."
|
" for KV Pool (e.g., the MooncakeStoreConnector)."
|
||||||
msgstr "用户只需启用前缀缓存(在 vLLM V1 中默认启用,除非设置了 `--no_enable_prefix_caching` 标志)并为 KV 池设置 KV 连接器(例如 MooncakeStoreConnector),即可同时启用这两个功能。"
|
msgstr ""
|
||||||
|
"用户只需启用前缀缓存(在 vLLM V1 中默认启用,除非设置了 `--no_enable_prefix_caching` 标志)并为 KV "
|
||||||
|
"池设置 KV 连接器(例如 MooncakeStoreConnector),即可同时启用这两个功能。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:38
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:38
|
||||||
msgid "**Workflow**:"
|
msgid "**Workflow**:"
|
||||||
@@ -149,7 +169,9 @@ msgid ""
|
|||||||
" the connector. If there are additional hits in the KV Pool, we get the "
|
" the connector. If there are additional hits in the KV Pool, we get the "
|
||||||
"**additional blocks only** from the KV Pool, and get the rest of the "
|
"**additional blocks only** from the KV Pool, and get the rest of the "
|
||||||
"blocks directly from HBM to minimize the data transfer latency."
|
"blocks directly from HBM to minimize the data transfer latency."
|
||||||
msgstr "获取 HBM 上的命中令牌数量后,引擎通过连接器查询 KV 池。如果在 KV 池中有额外的命中,我们**仅从 KV 池获取额外的块**,其余块则直接从 HBM 获取,以最小化数据传输延迟。"
|
msgstr ""
|
||||||
|
"获取 HBM 上的命中令牌数量后,引擎通过连接器查询 KV 池。如果在 KV 池中有额外的命中,我们**仅从 KV "
|
||||||
|
"池获取额外的块**,其余块则直接从 HBM 获取,以最小化数据传输延迟。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:44
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:44
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -173,7 +195,9 @@ msgid ""
|
|||||||
"Currently, we only perform put and get operations of KV Pool for "
|
"Currently, we only perform put and get operations of KV Pool for "
|
||||||
"**Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P "
|
"**Prefill Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P "
|
||||||
"KV Connector, i.e., MooncakeConnector."
|
"KV Connector, i.e., MooncakeConnector."
|
||||||
msgstr "目前,我们仅对**预填充节点**执行 KV 池的 put 和 get 操作,解码节点则通过 Mooncake P2P KV 连接器(即 MooncakeConnector)获取其 KV 缓存。"
|
msgstr ""
|
||||||
|
"目前,我们仅对**预填充节点**执行 KV 池的 put 和 get 操作,解码节点则通过 Mooncake P2P KV 连接器(即 "
|
||||||
|
"MooncakeConnector)获取其 KV 缓存。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:52
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:52
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -182,15 +206,20 @@ msgid ""
|
|||||||
"Nodes, while not sacrificing the data transfer efficiency between Prefill"
|
"Nodes, while not sacrificing the data transfer efficiency between Prefill"
|
||||||
" and Decode nodes with P2P KV Connector that transfers KV Caches between "
|
" and Decode nodes with P2P KV Connector that transfers KV Caches between "
|
||||||
"NPU devices directly."
|
"NPU devices directly."
|
||||||
msgstr "这样做的主要好处是,我们可以通过为预填充节点使用来自 HBM 和 KV 池的前缀缓存来减少计算量,从而保持性能增益,同时又不牺牲预填充节点与解码节点之间的数据传输效率,因为 P2P KV 连接器直接在 NPU 设备间传输 KV 缓存。"
|
msgstr ""
|
||||||
|
"这样做的主要好处是,我们可以通过为预填充节点使用来自 HBM 和 KV "
|
||||||
|
"池的前缀缓存来减少计算量,从而保持性能增益,同时又不牺牲预填充节点与解码节点之间的数据传输效率,因为 P2P KV 连接器直接在 NPU "
|
||||||
|
"设备间传输 KV 缓存。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:54
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:54
|
||||||
msgid ""
|
msgid ""
|
||||||
"To enable this feature, we need to set up both Mooncake Connector and "
|
"To enable this feature, we need to set up both Mooncake Connector and "
|
||||||
"Mooncake Store Connector with a Multi Connector, which is a KV Connector "
|
"MooncakeStore Connector with a Multi Connector, which is a KV Connector "
|
||||||
"class provided by vLLM that can call multiple KV Connectors in a specific"
|
"class provided by vLLM that can call multiple KV Connectors in a specific"
|
||||||
" order."
|
" order."
|
||||||
msgstr "要启用此功能,我们需要使用 Multi Connector 来设置 Mooncake Connector 和 Mooncake Store Connector。Multi Connector 是 vLLM 提供的一个 KV 连接器类,可以按特定顺序调用多个 KV 连接器。"
|
msgstr ""
|
||||||
|
"要启用此功能,我们需要使用 Multi Connector 来设置 Mooncake Connector 和 MooncakeStore "
|
||||||
|
"Connector。Multi Connector 是 vLLM 提供的一个 KV 连接器类,可以按特定顺序调用多个 KV 连接器。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:56
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:56
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -208,7 +237,9 @@ msgid ""
|
|||||||
"V1: through implementing the required methods defined in the KV connector"
|
"V1: through implementing the required methods defined in the KV connector"
|
||||||
" V1 base class, one can integrate a third-party KV cache transfer/storage"
|
" V1 base class, one can integrate a third-party KV cache transfer/storage"
|
||||||
" backend into the vLLM framework."
|
" backend into the vLLM framework."
|
||||||
msgstr "**MooncakeStoreConnectorV1** 继承自 vLLM V1 中的 KV Connector V1 类:通过实现 KV 连接器 V1 基类中定义的必要方法,可以将第三方 KV 缓存传输/存储后端集成到 vLLM 框架中。"
|
msgstr ""
|
||||||
|
"**MooncakeStoreConnectorV1** 继承自 vLLM V1 中的 KV Connector V1 类:通过实现 KV 连接器"
|
||||||
|
" V1 基类中定义的必要方法,可以将第三方 KV 缓存传输/存储后端集成到 vLLM 框架中。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:62
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:62
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -220,7 +251,12 @@ msgid ""
|
|||||||
"that allows async `get` and `put` of KV caches with multi-threading, and "
|
"that allows async `get` and `put` of KV caches with multi-threading, and "
|
||||||
"NPU-related data transfer optimization such as removing the `LocalBuffer`"
|
"NPU-related data transfer optimization such as removing the `LocalBuffer`"
|
||||||
" in LMCache to remove redundant data transfer."
|
" in LMCache to remove redundant data transfer."
|
||||||
msgstr "MooncakeStoreConnectorV1 也在很大程度上借鉴了 LMCacheConnectorV1,包括用于查找 KV 缓存键的 `Lookup Engine`/`Lookup Client` 设计,以及用于将令牌处理为前缀感知哈希的 `ChunkedTokenDatabase` 类和其他哈希相关设计。在此基础上,我们还添加了自己的设计,包括允许通过多线程异步 `get` 和 `put` KV 缓存的 `KVTransferThread`,以及与 NPU 相关的数据传输优化,例如移除 LMCache 中的 `LocalBuffer` 以消除冗余数据传输。"
|
msgstr ""
|
||||||
|
"MooncakeStoreConnectorV1 也在很大程度上借鉴了 LMCacheConnectorV1,包括用于查找 KV 缓存键的 "
|
||||||
|
"`Lookup Engine`/`Lookup Client` 设计,以及用于将令牌处理为前缀感知哈希的 "
|
||||||
|
"`ChunkedTokenDatabase` 类和其他哈希相关设计。在此基础上,我们还添加了自己的设计,包括允许通过多线程异步 `get` 和 "
|
||||||
|
"`put` KV 缓存的 `KVTransferThread`,以及与 NPU 相关的数据传输优化,例如移除 LMCache 中的 "
|
||||||
|
"`LocalBuffer` 以消除冗余数据传输。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:64
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:64
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -268,7 +304,8 @@ msgstr ""
|
|||||||
"`wait_for_layer_load`:可选;在分层 + 异步 KV 加载场景中等待层加载。\n"
|
"`wait_for_layer_load`:可选;在分层 + 异步 KV 加载场景中等待层加载。\n"
|
||||||
"`save_kv_layer`:可选;执行分层 KV 缓存放入 KV 池的操作。\n"
|
"`save_kv_layer`:可选;执行分层 KV 缓存放入 KV 池的操作。\n"
|
||||||
"`wait_for_save`:如果异步保存/放入 KV 缓存,则等待 KV 保存完成。\n"
|
"`wait_for_save`:如果异步保存/放入 KV 缓存,则等待 KV 保存完成。\n"
|
||||||
"`get_finished`:获取已完成 KV 传输的请求,如果 `put` 完成则为 `done_sending`,如果 `get` 完成则为 `done_receiving`。"
|
"`get_finished`:获取已完成 KV 传输的请求,如果 `put` 完成则为 `done_sending`,如果 `get` 完成则为 "
|
||||||
|
"`done_receiving`。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:82
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:82
|
||||||
msgid "DFX"
|
msgid "DFX"
|
||||||
@@ -293,9 +330,9 @@ msgstr "限制"
|
|||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:89
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:89
|
||||||
msgid ""
|
msgid ""
|
||||||
"Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the "
|
"Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the "
|
||||||
"storage for KV Cache pool."
|
"storage for KV Cache pool."
|
||||||
msgstr "目前,vLLM-Ascend 的 Mooncake Store 仅支持 DRAM 作为 KV 缓存池的存储。"
|
msgstr "目前,vLLM-Ascend 的 MooncakeStore 仅支持 DRAM 作为 KV 缓存池的存储。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:91
|
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:91
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -306,4 +343,6 @@ msgid ""
|
|||||||
"situation by falling back the request and re-compute everything assuming "
|
"situation by falling back the request and re-compute everything assuming "
|
||||||
"there's no prefix cache hit (or even better, revert only one block and "
|
"there's no prefix cache hit (or even better, revert only one block and "
|
||||||
"keep using the Prefix Caches before that)."
|
"keep using the Prefix Caches before that)."
|
||||||
msgstr "目前,如果我们成功查找到一个键并发现它存在,但在调用 KV 池的 get 函数时失败,我们仅输出一条日志表明 get 操作失败并继续执行;因此,该特定请求的准确性可能会受到影响。我们将通过回退请求并假设没有前缀缓存命中来重新计算所有内容(或者更好的是,仅回退一个块并继续使用该块之前的前缀缓存)来处理这种情况。"
|
msgstr ""
|
||||||
|
"目前,如果我们成功查找到一个键并发现其存在,但在调用 KV 池的 get 函数时获取失败,我们仅输出一条日志表明 get "
|
||||||
|
"操作失败并继续执行;因此,该特定请求的准确性可能会受到影响。我们将通过回退请求并假设没有前缀缓存命中来重新计算所有内容(或者更优的方案是,仅回退一个块并继续使用该块之前的前缀缓存)来处理这种情况。"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -88,13 +88,15 @@ msgid ""
|
|||||||
"At last, these `Token IDs` are required to be fed into a model, and "
|
"At last, these `Token IDs` are required to be fed into a model, and "
|
||||||
"`positions` should also be sent into the model to create `Rope` (Rotary "
|
"`positions` should also be sent into the model to create `Rope` (Rotary "
|
||||||
"positional embedding). Both of them are the inputs of the model."
|
"positional embedding). Both of them are the inputs of the model."
|
||||||
msgstr "最后,这些 `Token IDs` 需要输入到模型中,`positions` 也需要送入模型以创建 `Rope`(旋转位置编码)。两者共同构成模型的输入。"
|
msgstr ""
|
||||||
|
"最后,这些 `Token IDs` 需要输入到模型中,`positions` 也需要送入模型以创建 "
|
||||||
|
"`Rope`(旋转位置编码)。两者共同构成模型的输入。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:38
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:38
|
||||||
msgid ""
|
msgid ""
|
||||||
"**Note**: The `Token IDs` are the inputs of a model, so we also call them"
|
"**Note**: The `Token IDs` are the inputs of a model, so we also call them"
|
||||||
" `Inputs IDs`."
|
" `Input IDs`."
|
||||||
msgstr "**注意**:`Token IDs` 是模型的输入,因此我们也称它们为 `Inputs IDs`。"
|
msgstr "**注意**:`Token IDs` 是模型的输入,因此我们也称它们为 `Input IDs`。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:40
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:40
|
||||||
msgid "2. Build inputs attention metadata"
|
msgid "2. Build inputs attention metadata"
|
||||||
@@ -185,14 +187,19 @@ msgid ""
|
|||||||
"len)`. Here, `max num request` is the maximum count of concurrent "
|
"len)`. Here, `max num request` is the maximum count of concurrent "
|
||||||
"requests allowed in a forward batch and `max model len` is the maximum "
|
"requests allowed in a forward batch and `max model len` is the maximum "
|
||||||
"token count that can be handled at one request sequence in this model."
|
"token count that can be handled at one request sequence in this model."
|
||||||
msgstr "**Token IDs table**:存储每个请求的 token IDs(即模型的输入)。此表的形状为 `(max num request, max model len)`。其中,`max num request` 是前向批次中允许的最大并发请求数,`max model len` 是该模型中单个请求序列可以处理的最大 token 数量。"
|
msgstr ""
|
||||||
|
"**Token IDs table**:存储每个请求的 token IDs(即模型的输入)。此表的形状为 `(max num request, "
|
||||||
|
"max model len)`。其中,`max num request` 是前向批次中允许的最大并发请求数,`max model len` "
|
||||||
|
"是该模型中单个请求序列可以处理的最大 token 数量。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:62
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:62
|
||||||
msgid ""
|
msgid ""
|
||||||
"**Block table**: translates the logical address (within its sequence) of "
|
"**Block table**: translates the logical address (within its sequence) of "
|
||||||
"each block to its global physical address in the device's memory. The "
|
"each block to its global physical address in the device's memory. The "
|
||||||
"shape of this table is `(max num request, max model len / block size)`"
|
"shape of this table is `(max num request, max model len / block size)`"
|
||||||
msgstr "**Block table**:将每个块在其序列内的逻辑地址转换为其在设备内存中的全局物理地址。此表的形状为 `(max num request, max model len / block size)`"
|
msgstr ""
|
||||||
|
"**Block table**:将每个块在其序列内的逻辑地址转换为其在设备内存中的全局物理地址。此表的形状为 `(max num request,"
|
||||||
|
" max model len / block size)`"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:64
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:64
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -255,13 +262,14 @@ msgid "Obtain inputs"
|
|||||||
msgstr "获取输入"
|
msgstr "获取输入"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:103
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:103
|
||||||
#, python-brace-format
|
|
||||||
msgid ""
|
msgid ""
|
||||||
"As the maximum number of tokens that can be scheduled is 10, the "
|
"As the maximum number of tokens that can be scheduled is 10, the "
|
||||||
"scheduled tokens of each request can be represented as `{'0': 3, '1': 2, "
|
"scheduled tokens of each request can be represented as `{'0': 3, '1': 2, "
|
||||||
"'2': 5}`. Note that `request_2` uses chunked prefill, leaving 3 prompt "
|
"'2': 5}`. Note that `request_2` uses chunked prefill, leaving 3 prompt "
|
||||||
"tokens unscheduled."
|
"tokens unscheduled."
|
||||||
msgstr "由于一次可调度的最大 token 数为 10,每个请求的已调度 token 可以表示为 `{'0': 3, '1': 2, '2': 5}`。注意 `request_2` 使用了分块预填充,留下了 3 个提示 token 未调度。"
|
msgstr ""
|
||||||
|
"由于一次可调度的最大 token 数为 10,每个请求的已调度 token 可以表示为 `{'0': 3, '1': 2, '2': 5}`。注意"
|
||||||
|
" `request_2` 使用了分块预填充,留下了 3 个提示 token 未调度。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:105
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:105
|
||||||
msgid "1. Get token positions"
|
msgid "1. Get token positions"
|
||||||
@@ -273,7 +281,10 @@ msgid ""
|
|||||||
"assigned to **request_0**, tokens 3–4 to **request_1**, and tokens 5–9 to"
|
"assigned to **request_0**, tokens 3–4 to **request_1**, and tokens 5–9 to"
|
||||||
" **request_2**. To represent this mapping, we use `request indices`, for "
|
" **request_2**. To represent this mapping, we use `request indices`, for "
|
||||||
"example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`."
|
"example, `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`."
|
||||||
msgstr "首先,确定每个 token 属于哪个请求:token 0–2 分配给 **request_0**,token 3–4 分配给 **request_1**,token 5–9 分配给 **request_2**。为了表示这种映射,我们使用 `request indices`,例如,`request indices`:`[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`。"
|
msgstr ""
|
||||||
|
"首先,确定每个 token 属于哪个请求:token 0–2 分配给 **request_0**,token 3–4 分配给 "
|
||||||
|
"**request_1**,token 5–9 分配给 **request_2**。为了表示这种映射,我们使用 `request "
|
||||||
|
"indices`,例如,`request indices`:`[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:109
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:109
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -281,7 +292,10 @@ msgid ""
|
|||||||
"position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + "
|
"position of current scheduled tokens** (`request_0: [0 + 0, 0 + 1, 0 + "
|
||||||
"2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`)"
|
"2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]`)"
|
||||||
" and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`)."
|
" and then concatenate them together (`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`)."
|
||||||
msgstr "对于每个请求,使用 **已计算 token 的数量** + **当前调度 token 的相对位置**(`request_0: [0 + 0, 0 + 1, 0 + 2]`,`request_1: [0 + 0, 0 + 1]`,`request_2: [0 + 0, 0 + 1,..., 0 + 4]`),然后将它们连接在一起(`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`)。"
|
msgstr ""
|
||||||
|
"对于每个请求,使用 **已计算 token 的数量** + **当前调度 token 的相对位置**(`request_0: [0 + 0, 0 "
|
||||||
|
"+ 1, 0 + 2]`,`request_1: [0 + 0, 0 + 1]`,`request_2: [0 + 0, 0 + 1,..., 0"
|
||||||
|
" + 4]`),然后将它们连接在一起(`[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`)。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:111
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:111
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -293,7 +307,9 @@ msgstr "注意:在实际代码中,有一种更高效的方法(使用 `requ
|
|||||||
msgid ""
|
msgid ""
|
||||||
"Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, "
|
"Finally, `token positions` can be obtained as `[0, 1, 2, 0, 1, 0, 1, 2, "
|
||||||
"3, 4]`. This variable is **token level**."
|
"3, 4]`. This variable is **token level**."
|
||||||
msgstr "最后,`token positions` 可以获取为 `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`。此变量是 **token 级别** 的。"
|
msgstr ""
|
||||||
|
"最后,`token positions` 可以获取为 `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`。此变量是 **token "
|
||||||
|
"级别** 的。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:115
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:115
|
||||||
msgid "2. Get token indices"
|
msgid "2. Get token indices"
|
||||||
@@ -326,14 +342,19 @@ msgstr "注意 `T_x_x` 是一个 `int32`。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"Let's say `M = max model len`. Then we can use `token positions` together"
|
"Let's say `M = max model len`. Then we can use `token positions` together"
|
||||||
" with `request indices` of each token to construct `token indices`."
|
" with `request indices` of each token to construct `token indices`."
|
||||||
msgstr "假设 `M = max model len`。那么我们可以使用 `token positions` 以及每个 token 的 `request indices` 来构造 `token indices`。"
|
msgstr ""
|
||||||
|
"假设 `M = max model len`。那么我们可以使用 `token positions` 以及每个 token 的 `request "
|
||||||
|
"indices` 来构造 `token indices`。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:137
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:137
|
||||||
msgid ""
|
msgid ""
|
||||||
"So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 "
|
"So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 "
|
||||||
"* M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2,"
|
"* M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2,"
|
||||||
" 12, 13, 24, 25, 26, 27, 28]`"
|
" 12, 13, 24, 25, 26, 27, 28]`"
|
||||||
msgstr "所以 `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`"
|
msgstr ""
|
||||||
|
"所以 `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 "
|
||||||
|
"* M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2,"
|
||||||
|
" 12, 13, 24, 25, 26, 27, 28]`"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:139
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:139
|
||||||
msgid "3. Retrieve the Token IDs"
|
msgid "3. Retrieve the Token IDs"
|
||||||
@@ -353,7 +374,9 @@ msgstr "如前所述,我们将这些 `Token IDs` 称为 `Input IDs`。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"`Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, "
|
"`Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, "
|
||||||
"T_3_3, T_3_4]`"
|
"T_3_3, T_3_4]`"
|
||||||
msgstr "`Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`"
|
msgstr ""
|
||||||
|
"`Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, "
|
||||||
|
"T_3_3, T_3_4]`"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:151
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:151
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:237
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:237
|
||||||
@@ -367,7 +390,8 @@ msgid ""
|
|||||||
"model len / block size)`, where `max model len / block size = 12 / 2 = "
|
"model len / block size)`, where `max model len / block size = 12 / 2 = "
|
||||||
"6`."
|
"6`."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"在当前的**块表**中,我们使用第一个块(即 block_0)来标记未使用的块。块的形状为 `(最大请求数, 最大模型长度 / 块大小)`,其中 `最大模型长度 / 块大小 = 12 / 2 = 6`。"
|
"在当前的**块表**中,我们使用第一个块(即 block_0)来标记未使用的块。块的形状为 `(最大请求数, 最大模型长度 / 块大小)`,其中 "
|
||||||
|
"`最大模型长度 / 块大小 = 12 / 2 = 6`。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:165
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:165
|
||||||
msgid "The KV cache block in the device memory is like:"
|
msgid "The KV cache block in the device memory is like:"
|
||||||
@@ -434,7 +458,11 @@ msgid ""
|
|||||||
" / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to "
|
" / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`. This could be used to "
|
||||||
"select `device block number` from `block table`."
|
"select `device block number` from `block table`."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"(**令牌级别**) 使用一个简单的公式计算`块表索引`:`request indices * K + positions / block size`。因此它等于 `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2, 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / 2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, 14]`。这可用于从`块表`中选择`设备块编号`。"
|
"(**令牌级别**) 使用一个简单的公式计算`块表索引`:`request indices * K + positions / block "
|
||||||
|
"size`。因此它等于 `[0 * 6 + 0 / 2, 0 * 6 + 1 / 2, 0 * 6 + 2 / 2, 1 * 6 + 0 / 2,"
|
||||||
|
" 1 * 6 + 1 / 2, 2 * 6 + 0 / 2, 2 * 6 + 1 / 2, 2 * 6 + 2 / 2, 2 * 6 + 3 / "
|
||||||
|
"2, 2 * 6 + 4 / 2] = [0, 0, 1, 6, 6, 12, 12, 13, 13, "
|
||||||
|
"14]`。这可用于从`块表`中选择`设备块编号`。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:194
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:194
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -443,14 +471,17 @@ msgid ""
|
|||||||
"block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, "
|
"block_table[block_table_indices]`. So `device block number=[1, 1, 2, 3, "
|
||||||
"3, 4, 4, 5, 5, 6]`"
|
"3, 4, 4, 5, 5, 6]`"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"(**令牌级别**) 使用`块表索引`为每个已调度的令牌选择出`设备块编号`。伪代码为 `block_numbers = block_table[block_table_indices]`。因此 `设备块编号=[1, 1, 2, 3, 3, 4, 4, 5, 5, 6]`"
|
"(**令牌级别**) 使用`块表索引`为每个已调度的令牌选择出`设备块编号`。伪代码为 `block_numbers = "
|
||||||
|
"block_table[block_table_indices]`。因此 `设备块编号=[1, 1, 2, 3, 3, 4, 4, 5, 5, "
|
||||||
|
"6]`"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:195
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:195
|
||||||
msgid ""
|
msgid ""
|
||||||
"(**Token level**) `block offsets` could be computed by `block offsets = "
|
"(**Token level**) `block offsets` could be computed by `block offsets = "
|
||||||
"positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`."
|
"positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]`."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"(**令牌级别**) `块内偏移`可以通过 `block offsets = positions % block size = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]` 计算得出。"
|
"(**令牌级别**) `块内偏移`可以通过 `block offsets = positions % block size = [0, 1, 0,"
|
||||||
|
" 0, 1, 0, 1, 0, 1, 0]` 计算得出。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:196
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:196
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -458,7 +489,8 @@ msgid ""
|
|||||||
"mapping`: `device block number * block size + block_offsets = [2, 3, 4, "
|
"mapping`: `device block number * block size + block_offsets = [2, 3, 4, "
|
||||||
"6, 7, 8, 9, 10, 11, 12]`"
|
"6, 7, 8, 9, 10, 11, 12]`"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"最后,使用`块内偏移`和`设备块编号`创建`槽映射`:`设备块编号 * 块大小 + 块内偏移 = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`"
|
"最后,使用`块内偏移`和`设备块编号`创建`槽映射`:`设备块编号 * 块大小 + 块内偏移 = [2, 3, 4, 6, 7, 8, 9, "
|
||||||
|
"10, 11, 12]`"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:198
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:198
|
||||||
msgid "(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:"
|
msgid "(**Request level**) As we know the scheduled token count is `[3, 2, 5]`:"
|
||||||
@@ -538,7 +570,9 @@ msgid ""
|
|||||||
"**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and "
|
"**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and "
|
||||||
"**request_1** respectively. They are sampled from the output of the "
|
"**request_1** respectively. They are sampled from the output of the "
|
||||||
"model."
|
"model."
|
||||||
msgstr "**注意**:**T_0_3**、**T_1_2** 分别是 **request_0** 和 **request_1** 的新令牌 ID。它们是从模型输出中采样得到的。"
|
msgstr ""
|
||||||
|
"**注意**:**T_0_3**、**T_1_2** 分别是 **request_0** 和 **request_1** 的新令牌 "
|
||||||
|
"ID。它们是从模型输出中采样得到的。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:234
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:234
|
||||||
msgid "`token indices`: `[3, 14, 29, 30, 31]`"
|
msgid "`token indices`: `[3, 14, 29, 30, 31]`"
|
||||||
@@ -553,7 +587,9 @@ msgid ""
|
|||||||
"We allocate the blocks `7` and `8` to `request_1` and `request_2` "
|
"We allocate the blocks `7` and `8` to `request_1` and `request_2` "
|
||||||
"respectively, as they need more space in device to store KV cache "
|
"respectively, as they need more space in device to store KV cache "
|
||||||
"following token generation or chunked prefill."
|
"following token generation or chunked prefill."
|
||||||
msgstr "我们将块 `7` 和 `8` 分别分配给 `request_1` 和 `request_2`,因为它们在令牌生成或分块预填充后需要更多设备空间来存储 KV 缓存。"
|
msgstr ""
|
||||||
|
"我们将块 `7` 和 `8` 分别分配给 `request_1` 和 "
|
||||||
|
"`request_2`,因为它们在令牌生成或分块预填充后需要更多设备空间来存储 KV 缓存。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:241
|
#: ../../source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md:241
|
||||||
msgid "Current **Block Table**:"
|
msgid "Current **Block Table**:"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -35,9 +35,8 @@ msgid ""
|
|||||||
"Ascend NPUs and is automatically executed during worker initialization "
|
"Ascend NPUs and is automatically executed during worker initialization "
|
||||||
"when enabled."
|
"when enabled."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"CPU 绑定将 vLLM Ascend 工作进程和关键线程固定到特定的 CPU 核心,以减少 CPU-"
|
"CPU 绑定将 vLLM Ascend 工作进程和关键线程固定到特定的 CPU 核心,以减少 CPU-NPU 跨 NUMA "
|
||||||
"NPU 跨 NUMA 流量,并在多进程工作负载下稳定延迟。它专为运行 Ascend NPU 的 ARM "
|
"流量,并在多进程工作负载下稳定延迟。它专为运行 Ascend NPU 的 ARM 服务器设计,启用后会在工作进程初始化期间自动执行。"
|
||||||
"服务器设计,启用后会在工作进程初始化期间自动执行。"
|
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:7
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:7
|
||||||
msgid "Background"
|
msgid "Background"
|
||||||
@@ -53,10 +52,9 @@ msgid ""
|
|||||||
"purely a host‑side affinity policy and does not change model execution "
|
"purely a host‑side affinity policy and does not change model execution "
|
||||||
"logic."
|
"logic."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"在多插槽 ARM 系统上,操作系统调度器可能会将 vLLM 线程放置在远离本地 NPU 的 "
|
"在多插槽 ARM 系统上,操作系统调度器可能会将 vLLM 线程放置在远离本地 NPU 的 CPU 上,从而导致 NUMA "
|
||||||
"CPU 上,从而导致 NUMA 跨域流量和延迟抖动。CPU 绑定强制执行一种确定性的 CPU "
|
"跨域流量和延迟抖动。CPU 绑定强制执行一种确定性的 CPU 放置策略,并可选地将 NPU IRQ 绑定到同一个 CPU "
|
||||||
"放置策略,并可选地将 NPU IRQ 绑定到同一个 CPU 池。这与其他性能特性(如图模式"
|
"池。这与其他性能特性(如图模式或动态批处理)不同,因为它纯粹是主机端的亲和性策略,不改变模型执行逻辑。"
|
||||||
"或动态批处理)不同,因为它纯粹是主机端的亲和性策略,不改变模型执行逻辑。"
|
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:11
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:11
|
||||||
msgid "Design & How it works"
|
msgid "Design & How it works"
|
||||||
@@ -71,8 +69,8 @@ msgid ""
|
|||||||
"**Allowed CPU list**: The cpuset from /proc/self/status "
|
"**Allowed CPU list**: The cpuset from /proc/self/status "
|
||||||
"(Cpus_allowed_list). All allocations are constrained to this list."
|
"(Cpus_allowed_list). All allocations are constrained to this list."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"**允许的 CPU 列表**:来自 /proc/self/status (Cpus_allowed_list) 的 cpuset。"
|
"**允许的 CPU 列表**:来自 /proc/self/status (Cpus_allowed_list) 的 "
|
||||||
"所有分配都受限于此列表。"
|
"cpuset。所有分配都受限于此列表。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:16
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:16
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -86,8 +84,7 @@ msgstr ""
|
|||||||
msgid ""
|
msgid ""
|
||||||
"**CPU pool per NPU**: The CPU list assigned to each logical NPU ID based "
|
"**CPU pool per NPU**: The CPU list assigned to each logical NPU ID based "
|
||||||
"on the binding mode."
|
"on the binding mode."
|
||||||
msgstr ""
|
msgstr "**每个 NPU 的 CPU 池**:根据绑定模式分配给每个逻辑 NPU ID 的 CPU 列表。"
|
||||||
"**每个 NPU 的 CPU 池**:根据绑定模式分配给每个逻辑 NPU ID 的 CPU 列表。"
|
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:18
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:18
|
||||||
msgid "**Binding modes & Device behavior**:"
|
msgid "**Binding modes & Device behavior**:"
|
||||||
@@ -119,8 +116,8 @@ msgid ""
|
|||||||
"logical NPUs**, ensuring each NPU is assigned a contiguous segment of CPU"
|
"logical NPUs**, ensuring each NPU is assigned a contiguous segment of CPU"
|
||||||
" cores. This prevents CPU core overlap across multiple process groups."
|
" cores. This prevents CPU core overlap across multiple process groups."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"根据**全局逻辑 NPU 总数**均匀分割允许的 CPU 列表,确保每个 NPU 被分配一个连"
|
"根据**全局逻辑 NPU 总数**均匀分割允许的 CPU 列表,确保每个 NPU 被分配一个连续的 CPU 核心段。这可以防止多个进程组之间的 "
|
||||||
"续的 CPU 核心段。这可以防止多个进程组之间的 CPU 核心重叠。"
|
"CPU 核心重叠。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md
|
||||||
msgid "A2 / 310P / Others"
|
msgid "A2 / 310P / Others"
|
||||||
@@ -136,8 +133,8 @@ msgid ""
|
|||||||
"If multiple NPUs are assigned to a single NUMA node (which may cause "
|
"If multiple NPUs are assigned to a single NUMA node (which may cause "
|
||||||
"bandwidth contention), the CPU allocation extends to adjacent NUMA nodes."
|
"bandwidth contention), the CPU allocation extends to adjacent NUMA nodes."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"基于 NPU 拓扑亲和性 (`npu-smi info -t topo`) 分配 CPU。如果多个 NPU 被分配"
|
"基于 NPU 拓扑亲和性 (`npu-smi info -t topo`) 分配 CPU。如果多个 NPU 被分配到单个 NUMA "
|
||||||
"到单个 NUMA 节点(可能导致带宽争用),则 CPU 分配会扩展到相邻的 NUMA 节点。"
|
"节点(可能导致带宽争用),则 CPU 分配会扩展到相邻的 NUMA 节点。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:25
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:25
|
||||||
msgid "**Default**: enabled (enable_cpu_binding = true)."
|
msgid "**Default**: enabled (enable_cpu_binding = true)."
|
||||||
@@ -151,8 +148,7 @@ msgstr "**回退**:如果 NPU 拓扑亲和性不可用,则使用 global_slic
|
|||||||
msgid ""
|
msgid ""
|
||||||
"**Failure handling**: Any exception in binding is logged as a warning and"
|
"**Failure handling**: Any exception in binding is logged as a warning and"
|
||||||
" **binding is skipped for that rank**."
|
" **binding is skipped for that rank**."
|
||||||
msgstr ""
|
msgstr "**故障处理**:绑定过程中的任何异常都会记录为警告,并且**跳过该等级的绑定**。"
|
||||||
"**故障处理**:绑定过程中的任何异常都会记录为警告,并且**跳过该等级的绑定**。"
|
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:29
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:29
|
||||||
msgid "Execution flow (simplified)"
|
msgid "Execution flow (simplified)"
|
||||||
@@ -373,9 +369,7 @@ msgstr "`IRQ`: 600-601, `Main`: 602-637, `ACL`: 638, `Release`: 639"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"This layout remains deterministic even when multiple processes share the "
|
"This layout remains deterministic even when multiple processes share the "
|
||||||
"same cpuset, because slicing is based on the global logical NPU ID."
|
"same cpuset, because slicing is based on the global logical NPU ID."
|
||||||
msgstr ""
|
msgstr "即使多个进程共享同一个 cpuset,此布局也保持确定性,因为切片是基于全局逻辑 NPU ID 的。"
|
||||||
"即使多个进程共享同一个 cpuset,此布局也保持确定性,因为切片是基于全局逻辑 "
|
|
||||||
"NPU ID 的。"
|
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:86
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:86
|
||||||
msgid "Example 2: A3 global_slice, even split"
|
msgid "Example 2: A3 global_slice, even split"
|
||||||
@@ -389,6 +383,10 @@ msgstr "示例 2:A3 global_slice,均匀分割"
|
|||||||
msgid "**Inputs**:"
|
msgid "**Inputs**:"
|
||||||
msgstr "**输入**:"
|
msgstr "**输入**:"
|
||||||
|
|
||||||
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:90
|
||||||
|
msgid "allowed_cpus = [0..23] (24 CPUs)"
|
||||||
|
msgstr "allowed_cpus = [0..23] (24个CPU)"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:91
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:91
|
||||||
msgid ""
|
msgid ""
|
||||||
"NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..11, NUMA1 ="
|
"NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..11, NUMA1 ="
|
||||||
@@ -520,7 +518,10 @@ msgid ""
|
|||||||
"(6,7) and NUMA1 (8..11). This is a direct consequence of global slicing "
|
"(6,7) and NUMA1 (8..11). This is a direct consequence of global slicing "
|
||||||
"over the ordered cpuset; the remainder distribution does not enforce NUMA"
|
"over the ordered cpuset; the remainder distribution does not enforce NUMA"
|
||||||
" boundaries."
|
" boundaries."
|
||||||
msgstr "在上述对称NUMA布局中 (NUMA0 = 0..7, NUMA1 = 8..16),NPU0保持在NUMA0内,NPU2保持在NUMA1内,但NPU1跨越了NUMA0 (6,7) 和 NUMA1 (8..11)。这是对有序cpuset进行全局切片的直接结果;余数分配不强制NUMA边界。"
|
msgstr ""
|
||||||
|
"在上述对称NUMA布局中 (NUMA0 = 0..7, NUMA1 = "
|
||||||
|
"8..16),NPU0保持在NUMA0内,NPU2保持在NUMA1内,但NPU1跨越了NUMA0 (6,7) 和 NUMA1 "
|
||||||
|
"(8..11)。这是对有序cpuset进行全局切片的直接结果;余数分配不强制NUMA边界。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:134
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:134
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -539,7 +540,9 @@ msgid ""
|
|||||||
"avoid cross‑NUMA pools. A future enhancement should incorporate NUMA node"
|
"avoid cross‑NUMA pools. A future enhancement should incorporate NUMA node"
|
||||||
" boundaries into the slicing logic so that pools remain within a single "
|
" boundaries into the slicing logic so that pools remain within a single "
|
||||||
"NUMA node whenever possible."
|
"NUMA node whenever possible."
|
||||||
msgstr "使用当前的 `global_slice` 策略,某些CPU/NPU布局无法避免跨NUMA池。未来的增强应将NUMA节点边界纳入切片逻辑,以便池尽可能保持在单个NUMA节点内。"
|
msgstr ""
|
||||||
|
"使用当前的 `global_slice` "
|
||||||
|
"策略,某些CPU/NPU布局无法避免跨NUMA池。未来的增强应将NUMA节点边界纳入切片逻辑,以便池尽可能保持在单个NUMA节点内。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:140
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:140
|
||||||
msgid "Example 4: global_slice with visible subset of NPUs"
|
msgid "Example 4: global_slice with visible subset of NPUs"
|
||||||
@@ -594,7 +597,6 @@ msgid "Example 5: A2/310P topo_affinity with NUMA extension"
|
|||||||
msgstr "示例 5: 具有NUMA扩展的 A2/310P topo_affinity"
|
msgstr "示例 5: 具有NUMA扩展的 A2/310P topo_affinity"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:163
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:163
|
||||||
#, python-brace-format
|
|
||||||
msgid "npu_affinity = {0: [0..7], 1: [0..7]} (from `npu-smi info -t topo`)"
|
msgid "npu_affinity = {0: [0..7], 1: [0..7]} (from `npu-smi info -t topo`)"
|
||||||
msgstr "npu_affinity = {0: [0..7], 1: [0..7]} (来自 `npu-smi info -t topo`)"
|
msgstr "npu_affinity = {0: [0..7], 1: [0..7]} (来自 `npu-smi info -t topo`)"
|
||||||
|
|
||||||
@@ -745,11 +747,12 @@ msgid ""
|
|||||||
"0–31, NUMA1 = CPUs 32–63, and the cpuset is 0–63. With 4 logical NPUs, "
|
"0–31, NUMA1 = CPUs 32–63, and the cpuset is 0–63. With 4 logical NPUs, "
|
||||||
"global slicing yields 16 CPUs per NPU (0–15, 16–31, 32–47, 48–63), so "
|
"global slicing yields 16 CPUs per NPU (0–15, 16–31, 32–47, 48–63), so "
|
||||||
"each NPU’s pool stays within a single NUMA node."
|
"each NPU’s pool stays within a single NUMA node."
|
||||||
msgstr "示例(对称布局):2个NUMA节点,总共64个CPU。NUMA0 = CPU 0–31,NUMA1 = CPU 32–63,cpuset为0–63。对于4个逻辑NPU,全局切片每个NPU产生16个CPU (0–15, 16–31, 32–47, 48–63),因此每个NPU的池保持在单个NUMA节点内。"
|
msgstr ""
|
||||||
|
"示例(对称布局):2个NUMA节点,共64个CPU。NUMA0 = CPU 0–31,NUMA1 = CPU 32–63,cpuset为0–63。对于4个逻辑NPU,全局切片为每个NPU分配16个CPU (0–15, 16–31, 32–47, 48–63),因此每个NPU的CPU池都保持在单个NUMA节点内。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:212
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:212
|
||||||
msgid "**Runtime dependencies**:"
|
msgid "**Runtime dependencies**:"
|
||||||
msgstr "**运行时依赖**:"
|
msgstr "**运行时依赖项**:"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:213
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:213
|
||||||
msgid "Requires npu‑smi and lscpu commands."
|
msgid "Requires npu‑smi and lscpu commands."
|
||||||
@@ -761,13 +764,13 @@ msgstr "IRQ绑定需要对 /proc/irq 的写访问权限。"
|
|||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:215
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:215
|
||||||
msgid "Memory binding requires migratepages; otherwise it is skipped."
|
msgid "Memory binding requires migratepages; otherwise it is skipped."
|
||||||
msgstr "内存绑定需要 migratepages;否则将被跳过。"
|
msgstr "内存绑定需要 migratepages;否则将跳过此步骤。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:216
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:216
|
||||||
msgid ""
|
msgid ""
|
||||||
"**IRQ side effects**: irqbalance may be stopped to avoid overriding "
|
"**IRQ side effects**: irqbalance may be stopped to avoid overriding "
|
||||||
"bindings."
|
"bindings."
|
||||||
msgstr "**IRQ副作用**:可能会停止 irqbalance 以避免覆盖绑定。"
|
msgstr "**IRQ副作用**:可能会停止 irqbalance 服务以避免覆盖绑定。"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:217
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:217
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -788,13 +791,15 @@ msgstr "使用标准的 vLLM 日志配置来启用调试日志。当启用调试
|
|||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:223
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:223
|
||||||
msgid "References"
|
msgid "References"
|
||||||
msgstr "参考"
|
msgstr "参考资料"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:225
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:225
|
||||||
msgid ""
|
msgid ""
|
||||||
"CPU binding implementation: vllm_ascend/cpu_binding.py (`DeviceInfo`, "
|
"CPU binding implementation: vllm_ascend/cpu_binding.py (`DeviceInfo`, "
|
||||||
"`CpuAlloc`, `bind_cpus`)"
|
"`CpuAlloc`, `bind_cpus`)"
|
||||||
msgstr "CPU 绑定实现:vllm_ascend/cpu_binding.py (`DeviceInfo`, `CpuAlloc`, `bind_cpus`)"
|
msgstr ""
|
||||||
|
"CPU 绑定实现:vllm_ascend/cpu_binding.py (`DeviceInfo`, `CpuAlloc`, "
|
||||||
|
"`bind_cpus`)"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:226
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:226
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -807,7 +812,9 @@ msgid ""
|
|||||||
"Additional config option: "
|
"Additional config option: "
|
||||||
"docs/source/user_guide/configuration/additional_config.md "
|
"docs/source/user_guide/configuration/additional_config.md "
|
||||||
"(`enable_cpu_binding`)"
|
"(`enable_cpu_binding`)"
|
||||||
msgstr "附加配置选项:docs/source/user_guide/configuration/additional_config.md (`enable_cpu_binding`)"
|
msgstr ""
|
||||||
|
"附加配置选项:docs/source/user_guide/configuration/additional_config.md "
|
||||||
|
"(`enable_cpu_binding`)"
|
||||||
|
|
||||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:228
|
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:228
|
||||||
msgid "Tests: tests/ut/device_allocator/test_cpu_binding.py"
|
msgid "Tests: tests/ut/device_allocator/test_cpu_binding.py"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -53,7 +53,12 @@ msgid ""
|
|||||||
"memory usage, it would introduce additional communication and small "
|
"memory usage, it would introduce additional communication and small "
|
||||||
"operator overhead. Therefore, we will not enable the DCP feature on node "
|
"operator overhead. Therefore, we will not enable the DCP feature on node "
|
||||||
"d."
|
"d."
|
||||||
msgstr "以 Deepseek-V3.1-w8a8 模型为例,使用 3 台 Atlas 800T A3 服务器部署“1P1D”架构。节点 p 跨多台机器部署,而节点 d 部署在单台机器上。假设预填充服务器的 IP 为 192.0.0.1(预填充 1)和 192.0.0.2(预填充 2),解码器服务器为 192.0.0.3(解码器 1)。每台服务器使用 8 个 NPU(16 个芯片)部署一个服务实例。在当前示例中,我们将在节点 p 上启用上下文并行特性以改善 TTFT。虽然在节点 d 上启用 DCP 特性可以减少内存使用,但会引入额外的通信和小算子开销。因此,我们不会在节点 d 上启用 DCP 特性。"
|
msgstr ""
|
||||||
|
"以 Deepseek-V3.1-w8a8 模型为例,使用 3 台 Atlas 800T A3 服务器部署“1P1D”架构。节点 p "
|
||||||
|
"跨多台机器部署,而节点 d 部署在单台机器上。假设预填充服务器的 IP 为 192.0.0.1(预填充 1)和 192.0.0.2(预填充 "
|
||||||
|
"2),解码器服务器为 192.0.0.3(解码器 1)。每台服务器使用 8 个 NPU(16 个芯片)部署一个服务实例。在当前示例中,我们将在节点"
|
||||||
|
" p 上启用上下文并行特性以改善 TTFT。虽然在节点 d 上启用 DCP "
|
||||||
|
"特性可以减少内存使用,但会引入额外的通信和小算子开销。因此,我们不会在节点 d 上启用 DCP 特性。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:13
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:13
|
||||||
msgid "Environment Preparation"
|
msgid "Environment Preparation"
|
||||||
@@ -69,7 +74,11 @@ msgid ""
|
|||||||
"model weight](https://www.modelscope.cn/models/Eco-"
|
"model weight](https://www.modelscope.cn/models/Eco-"
|
||||||
"Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to "
|
"Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to "
|
||||||
"`bfloat16` in `config.json`."
|
"`bfloat16` in `config.json`."
|
||||||
msgstr "`DeepSeek-V3.1_w8a8mix_mtp`(混合 MTP 量化版本):[下载模型权重](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8)。请在 `config.json` 中将 `torch_dtype` 从 `float16` 修改为 `bfloat16`。"
|
msgstr ""
|
||||||
|
"`DeepSeek-V3.1_w8a8mix_mtp`(混合 MTP "
|
||||||
|
"量化版本):[下载模型权重](https://www.modelscope.cn/models/Eco-"
|
||||||
|
"Tech/DeepSeek-V3.1-w8a8)。请在 `config.json` 中将 `torch_dtype` 从 `float16` "
|
||||||
|
"修改为 `bfloat16`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:19
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:19
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -86,7 +95,9 @@ msgid ""
|
|||||||
"Refer to [verify multi-node communication "
|
"Refer to [verify multi-node communication "
|
||||||
"environment](../../installation.md#verify-multi-node-communication) to "
|
"environment](../../installation.md#verify-multi-node-communication) to "
|
||||||
"verify multi-node communication."
|
"verify multi-node communication."
|
||||||
msgstr "请参考[验证多节点通信环境](../../installation.md#verify-multi-node-communication)来验证多节点通信。"
|
msgstr ""
|
||||||
|
"请参考[验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||||
|
"communication)来验证多节点通信。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:25
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:25
|
||||||
msgid "Installation"
|
msgid "Installation"
|
||||||
@@ -101,7 +112,9 @@ msgid ""
|
|||||||
"Select an image based on your machine type and start the Docker image on "
|
"Select an image based on your machine type and start the Docker image on "
|
||||||
"your node, refer to [using Docker](../../installation.md#set-up-using-"
|
"your node, refer to [using Docker](../../installation.md#set-up-using-"
|
||||||
"docker)."
|
"docker)."
|
||||||
msgstr "根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考[使用 Docker](../../installation.md#set-up-using-docker)。"
|
msgstr ""
|
||||||
|
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考[使用 Docker](../../installation.md#set-"
|
||||||
|
"up-using-docker)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:64
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:64
|
||||||
msgid "You need to set up environment on each node."
|
msgid "You need to set up environment on each node."
|
||||||
@@ -119,7 +132,10 @@ msgid ""
|
|||||||
"socket listeners. To avoid any issues, port conflicts should be "
|
"socket listeners. To avoid any issues, port conflicts should be "
|
||||||
"prevented. Additionally, ensure that each node's engine_id is uniquely "
|
"prevented. Additionally, ensure that each node's engine_id is uniquely "
|
||||||
"assigned to avoid conflicts."
|
"assigned to avoid conflicts."
|
||||||
msgstr "我们可以分别在预填充器/解码器节点上运行以下脚本来启动服务器。请注意,每个 P/D 节点将占用从 kv_port 到 kv_port + num_chips 的端口范围来初始化 socket 监听器。为避免任何问题,应防止端口冲突。此外,请确保每个节点的 engine_id 被唯一分配以避免冲突。"
|
msgstr ""
|
||||||
|
"我们可以分别在预填充器/解码器节点上运行以下脚本来启动服务器。请注意,每个 P/D 节点将占用从 kv_port 到 kv_port + "
|
||||||
|
"num_chips 的端口范围来初始化 socket 监听器。为避免任何问题,应防止端口冲突。此外,请确保每个节点的 engine_id "
|
||||||
|
"被唯一分配以避免冲突。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:70
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:70
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -154,7 +170,10 @@ msgid ""
|
|||||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||||
"project/vllm-"
|
"project/vllm-"
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
msgstr "在与预填充服务实例相同的节点上运行代理服务器。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
msgstr ""
|
||||||
|
"在与预填充服务实例相同的节点上运行代理服务器。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com"
|
||||||
|
"/vllm-project/vllm-"
|
||||||
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:301
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:301
|
||||||
msgid "**Notice:** The parameters are explained as follows:"
|
msgid "**Notice:** The parameters are explained as follows:"
|
||||||
@@ -193,21 +212,29 @@ msgid ""
|
|||||||
"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
|
"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
|
||||||
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
||||||
"`--data-parallel-size` >= the actual total concurrency."
|
"`--data-parallel-size` >= the actual total concurrency."
|
||||||
msgstr "`--max-num-seqs` 表示每个 DP 组允许处理的最大请求数。如果发送到服务的请求数量超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= 实际总并发数。"
|
msgstr ""
|
||||||
|
"`--max-num-seqs` 表示每个 DP "
|
||||||
|
"组允许处理的最大请求数。如果发送到服务的请求数量超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 "
|
||||||
|
"TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` "
|
||||||
|
">= 实际总并发数。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:309
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:309
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||||
"enables ChunkPrefill/SplitFuse by default, which means:"
|
"enables ChunkPrefill/SplitFuse by default, which means:"
|
||||||
msgstr "`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 ChunkPrefill/SplitFuse,这意味着:"
|
msgstr ""
|
||||||
|
"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||||
|
"ChunkPrefill/SplitFuse,这意味着:"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:310
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:310
|
||||||
msgid ""
|
msgid ""
|
||||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||||
"tokens`, it will be divided into multiple rounds of computation according"
|
"tokens`, it will be divided into multiple rounds of computation according"
|
||||||
" to `--max-num-batched-tokens`;"
|
" to `--max-num-batched-tokens`;"
|
||||||
msgstr "(1)如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens` 被分成多轮计算;"
|
msgstr ""
|
||||||
|
"(1)如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens`"
|
||||||
|
" 被分成多轮计算;"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:311
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:311
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -236,14 +263,22 @@ msgid ""
|
|||||||
"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
|
"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
|
||||||
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
||||||
"during actual inference. The default value is `0.9`."
|
"during actual inference. The default value is `0.9`."
|
||||||
msgstr "`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache 大小。在预热阶段(vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可用的 kv_cache 就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理期间不同(例如,由于 EP 负载不均),将 `--gpu-memory-utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
msgstr ""
|
||||||
|
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
|
||||||
|
"大小。在预热阶段(vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` "
|
||||||
|
"的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * "
|
||||||
|
"HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可用的 kv_cache "
|
||||||
|
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理期间不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||||
|
"utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:314
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:314
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||||
"use pure EP or pure TP."
|
"use pure EP or pure TP."
|
||||||
msgstr "`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE 只能使用纯 EP 或纯 TP。"
|
msgstr ""
|
||||||
|
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||||
|
"只能使用纯 EP 或纯 TP。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:315
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:315
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -266,7 +301,11 @@ msgid ""
|
|||||||
"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
|
"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
|
||||||
"mainly used to reduce the cost of operator dispatch. Currently, "
|
"mainly used to reduce the cost of operator dispatch. Currently, "
|
||||||
"\"FULL_DECODE_ONLY\" is recommended."
|
"\"FULL_DECODE_ONLY\" is recommended."
|
||||||
msgstr "`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和 \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 \"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 \"FULL_DECODE_ONLY\"。"
|
msgstr ""
|
||||||
|
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
|
||||||
|
" \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 "
|
||||||
|
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||||
|
"\"FULL_DECODE_ONLY\"。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:319
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:319
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -276,14 +315,19 @@ msgid ""
|
|||||||
" inputs between levels are automatically padded to the next level. "
|
" inputs between levels are automatically padded to the next level. "
|
||||||
"Currently, the default setting is recommended. Only in some scenarios is "
|
"Currently, the default setting is recommended. Only in some scenarios is "
|
||||||
"it necessary to set this separately to achieve optimal performance."
|
"it necessary to set this separately to achieve optimal performance."
|
||||||
msgstr "\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一级别。目前推荐使用默认设置。仅在部分场景中,需要单独设置此参数以达到最佳性能。"
|
msgstr ""
|
||||||
|
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
|
||||||
|
"40,..., `--max-num-"
|
||||||
|
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一级别。目前推荐使用默认设置。仅在部分场景中,需要单独设置此参数以达到最佳性能。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:320
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:320
|
||||||
msgid ""
|
msgid ""
|
||||||
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
|
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
|
||||||
"optimization is enabled. Currently, this optimization is only supported "
|
"optimization is enabled. Currently, this optimization is only supported "
|
||||||
"for MoE in scenarios where tensor-parallel-size > 1."
|
"for MoE in scenarios where tensor-parallel-size > 1."
|
||||||
msgstr "`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 tensor-parallel-size > 1 的场景下对 MoE 提供支持。"
|
msgstr ""
|
||||||
|
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 "
|
||||||
|
"tensor-parallel-size > 1 的场景下对 MoE 提供支持。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:321
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:321
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -291,7 +335,9 @@ msgid ""
|
|||||||
"parallel is enabled. This environment variable is required in the PD "
|
"parallel is enabled. This environment variable is required in the PD "
|
||||||
"architecture but not needed in the PD co-locate deployment scenario. It "
|
"architecture but not needed in the PD co-locate deployment scenario. It "
|
||||||
"will be removed in the future."
|
"will be removed in the future."
|
||||||
msgstr "`export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` 表示启用了上下文并行。此环境变量在 PD 架构中是必需的,但在 PD 共置部署场景中不需要。未来将被移除。"
|
msgstr ""
|
||||||
|
"`export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` 表示启用了上下文并行。此环境变量在 PD "
|
||||||
|
"架构中是必需的,但在 PD 共置部署场景中不需要。未来将被移除。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:323
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:323
|
||||||
msgid "**Notice:**"
|
msgid "**Notice:**"
|
||||||
@@ -314,22 +360,18 @@ msgid "Accuracy Evaluation"
|
|||||||
msgstr "精度评估"
|
msgstr "精度评估"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:330
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:330
|
||||||
msgid "Here are two accuracy evaluation methods."
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:342
|
||||||
msgstr "以下是两种精度评估方法。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:332
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:344
|
|
||||||
msgid "Using AISBench"
|
msgid "Using AISBench"
|
||||||
msgstr "使用 AISBench"
|
msgstr "使用 AISBench"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:334
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:332
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [Using "
|
"Refer to [Using "
|
||||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||||
"details."
|
"details."
|
||||||
msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:336
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:334
|
||||||
msgid ""
|
msgid ""
|
||||||
"After execution, you can get the result, here is the result of "
|
"After execution, you can get the result, here is the result of "
|
||||||
"`DeepSeek-V3.1-w8a8` for reference only."
|
"`DeepSeek-V3.1-w8a8` for reference only."
|
||||||
@@ -375,52 +417,55 @@ msgstr "生成"
|
|||||||
msgid "86.67"
|
msgid "86.67"
|
||||||
msgstr "86.67"
|
msgstr "86.67"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:342
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:340
|
||||||
msgid "Performance"
|
msgid "Performance"
|
||||||
msgstr "性能"
|
msgstr "性能"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:346
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:344
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [Using AISBench for performance "
|
"Refer to [Using AISBench for performance "
|
||||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
"performance-evaluation) for details."
|
"performance-evaluation) for details."
|
||||||
msgstr "详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
msgstr ""
|
||||||
|
"详情请参阅[使用 AISBench "
|
||||||
|
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
|
"performance-evaluation)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:348
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:346
|
||||||
msgid "Using vLLM Benchmark"
|
msgid "Using vLLM Benchmark"
|
||||||
msgstr "使用 vLLM 基准测试"
|
msgstr "使用 vLLM 基准测试"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:350
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:348
|
||||||
msgid "Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example."
|
msgid "Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example."
|
||||||
msgstr "以运行 `DeepSeek-V3.1-w8a8` 的性能评估为例。"
|
msgstr "以运行 `DeepSeek-V3.1-w8a8` 的性能评估为例。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:352
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:350
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
||||||
"for more details."
|
"for more details."
|
||||||
msgstr "更多详情请参阅 [vllm 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
msgstr "更多详情请参阅 [vllm 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:354
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:352
|
||||||
msgid "There are three `vllm bench` subcommands:"
|
msgid "There are three `vllm bench` subcommands:"
|
||||||
msgstr "`vllm bench` 包含三个子命令:"
|
msgstr "`vllm bench` 包含三个子命令:"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:356
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:354
|
||||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||||
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:357
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:355
|
||||||
msgid "`serve`: Benchmark the online serving throughput."
|
msgid "`serve`: Benchmark the online serving throughput."
|
||||||
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:358
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:356
|
||||||
msgid "`throughput`: Benchmark offline inference throughput."
|
msgid "`throughput`: Benchmark offline inference throughput."
|
||||||
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:360
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:358
|
||||||
msgid "Take the `serve` as an example. Run the code as follows."
|
msgid "Take the `serve` as an example. Run the code as follows."
|
||||||
msgstr "以 `serve` 为例,按如下方式运行代码。"
|
msgstr "以 `serve` 为例,按如下方式运行代码。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:367
|
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:365
|
||||||
msgid ""
|
msgid ""
|
||||||
"After about several minutes, you can get the performance evaluation "
|
"After about several minutes, you can get the performance evaluation "
|
||||||
"result."
|
"result."
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -38,7 +38,9 @@ msgid ""
|
|||||||
"Using the `Qwen3-235B-A22B-w8a8` (Quantized version) model as an example,"
|
"Using the `Qwen3-235B-A22B-w8a8` (Quantized version) model as an example,"
|
||||||
" use 1 Atlas 800 A3 (64G × 16) server to deploy the single node \"pd co-"
|
" use 1 Atlas 800 A3 (64G × 16) server to deploy the single node \"pd co-"
|
||||||
"locate\" architecture."
|
"locate\" architecture."
|
||||||
msgstr "以 `Qwen3-235B-A22B-w8a8`(量化版本)模型为例,使用 1 台 Atlas 800 A3(64G × 16)服务器部署单节点 \"pd co-locate\" 架构。"
|
msgstr ""
|
||||||
|
"以 `Qwen3-235B-A22B-w8a8`(量化版本)模型为例,使用 1 台 Atlas 800 A3(64G × 16)服务器部署单节点 "
|
||||||
|
"\"pd co-locate\" 架构。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:9
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:9
|
||||||
msgid "Environment Preparation"
|
msgid "Environment Preparation"
|
||||||
@@ -53,7 +55,10 @@ msgid ""
|
|||||||
"`Qwen3-235B-A22B-w8a8` (Quantized version): requires 1 Atlas 800 A3 (64G "
|
"`Qwen3-235B-A22B-w8a8` (Quantized version): requires 1 Atlas 800 A3 (64G "
|
||||||
"× 16) node. [Download model weight](https://modelscope.cn/models/vllm-"
|
"× 16) node. [Download model weight](https://modelscope.cn/models/vllm-"
|
||||||
"ascend/Qwen3-235B-A22B-W8A8)"
|
"ascend/Qwen3-235B-A22B-W8A8)"
|
||||||
msgstr "`Qwen3-235B-A22B-w8a8`(量化版本):需要 1 个 Atlas 800 A3(64G × 16)节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)"
|
msgstr ""
|
||||||
|
"`Qwen3-235B-A22B-w8a8`(量化版本):需要 1 个 Atlas 800 A3(64G × "
|
||||||
|
"16)节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-"
|
||||||
|
"W8A8)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:15
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:15
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -69,6 +74,42 @@ msgstr "使用 Docker 运行"
|
|||||||
msgid "Start a Docker container on each node."
|
msgid "Start a Docker container on each node."
|
||||||
msgstr "在每个节点上启动一个 Docker 容器。"
|
msgstr "在每个节点上启动一个 Docker 容器。"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "dataset"
|
||||||
|
msgstr "数据集"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "version"
|
||||||
|
msgstr "版本"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "metric"
|
||||||
|
msgstr "指标"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "mode"
|
||||||
|
msgstr "模式"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "vllm-api-general-chat"
|
||||||
|
msgstr "vllm-api-general-chat"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "aime2024"
|
||||||
|
msgstr "aime2024"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "-"
|
||||||
|
msgstr "-"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "accuracy"
|
||||||
|
msgstr "准确率"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
|
msgid "gen"
|
||||||
|
msgstr "生成"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:63
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:63
|
||||||
msgid "Deployment"
|
msgid "Deployment"
|
||||||
msgstr "部署"
|
msgstr "部署"
|
||||||
@@ -81,7 +122,9 @@ msgstr "单节点部署"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16). "
|
"`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16). "
|
||||||
"Quantized version needs to start with parameter `--quantization ascend`."
|
"Quantized version needs to start with parameter `--quantization ascend`."
|
||||||
msgstr "`Qwen3-235B-A22B-w8a8` 可以部署在 1 台 Atlas 800 A3(64G*16)上。量化版本需要使用参数 `--quantization ascend` 启动。"
|
msgstr ""
|
||||||
|
"`Qwen3-235B-A22B-w8a8` 可以部署在 1 台 Atlas 800 A3(64G*16)上。量化版本需要使用参数 "
|
||||||
|
"`--quantization ascend` 启动。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:70
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:70
|
||||||
msgid "Run the following script to execute online 128k inference."
|
msgid "Run the following script to execute online 128k inference."
|
||||||
@@ -98,7 +141,10 @@ msgid ""
|
|||||||
"for vllm version below `v0.12.0` use parameter: `--rope_scaling "
|
"for vllm version below `v0.12.0` use parameter: `--rope_scaling "
|
||||||
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
|
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
|
||||||
" \\`"
|
" \\`"
|
||||||
msgstr "对于 vllm 版本低于 `v0.12.0`,使用参数:`--rope_scaling '{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}' \\`"
|
msgstr ""
|
||||||
|
"对于 vllm 版本低于 `v0.12.0`,使用参数:`--rope_scaling "
|
||||||
|
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
|
||||||
|
" \\`"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:109
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:109
|
||||||
#, python-brace-format
|
#, python-brace-format
|
||||||
@@ -107,7 +153,10 @@ msgid ""
|
|||||||
"'{\"rope_parameters\": "
|
"'{\"rope_parameters\": "
|
||||||
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
|
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
|
||||||
" \\`"
|
" \\`"
|
||||||
msgstr "对于 vllm 版本 `v0.12.0`,使用参数:`--hf-overrides '{\"rope_parameters\": {\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}' \\`"
|
msgstr ""
|
||||||
|
"对于 vllm 版本 `v0.12.0`,使用参数:`--hf-overrides '{\"rope_parameters\": "
|
||||||
|
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
|
||||||
|
" \\`"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:111
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:111
|
||||||
msgid "The parameters are explained as follows:"
|
msgid "The parameters are explained as follows:"
|
||||||
@@ -146,21 +195,29 @@ msgid ""
|
|||||||
"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
|
"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
|
||||||
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
||||||
"`--data-parallel-size` >= the actual total concurrency."
|
"`--data-parallel-size` >= the actual total concurrency."
|
||||||
msgstr "`--max-num-seqs` 表示每个 DP 组允许处理的最大请求数。如果发送到服务的请求数量超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= 实际总并发数。"
|
msgstr ""
|
||||||
|
"`--max-num-seqs` 表示每个 DP "
|
||||||
|
"组允许处理的最大请求数。如果发送到服务的请求数量超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 "
|
||||||
|
"TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` "
|
||||||
|
">= 实际总并发数。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:118
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:118
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||||
"enables ChunkPrefill/SplitFuse by default, which means:"
|
"enables ChunkPrefill/SplitFuse by default, which means:"
|
||||||
msgstr "`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 ChunkPrefill/SplitFuse,这意味着:"
|
msgstr ""
|
||||||
|
"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||||
|
"ChunkPrefill/SplitFuse,这意味着:"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:119
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:119
|
||||||
msgid ""
|
msgid ""
|
||||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||||
"tokens`, it will be divided into multiple rounds of computation according"
|
"tokens`, it will be divided into multiple rounds of computation according"
|
||||||
" to `--max-num-batched-tokens`;"
|
" to `--max-num-batched-tokens`;"
|
||||||
msgstr "(1)如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens` 被分成多轮计算;"
|
msgstr ""
|
||||||
|
"(1)如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens`"
|
||||||
|
" 被分成多轮计算;"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:120
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:120
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -189,14 +246,22 @@ msgid ""
|
|||||||
"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
|
"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
|
||||||
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
||||||
"during actual inference. The default value is `0.9`."
|
"during actual inference. The default value is `0.9`."
|
||||||
msgstr "`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache 大小。在预热阶段(vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可用的 kv_cache 就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理时不同(例如,由于 EP 负载不均),将 `--gpu-memory-utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
msgstr ""
|
||||||
|
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
|
||||||
|
"大小。在预热阶段(vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` "
|
||||||
|
"的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * "
|
||||||
|
"HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可用的 kv_cache "
|
||||||
|
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理时不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||||
|
"utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:123
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:123
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||||
"use pure EP or pure TP."
|
"use pure EP or pure TP."
|
||||||
msgstr "`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE 要么使用纯 EP,要么使用纯 TP。"
|
msgstr ""
|
||||||
|
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||||
|
"要么使用纯 EP,要么使用纯 TP。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:124
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:124
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -219,7 +284,11 @@ msgid ""
|
|||||||
"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
|
"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
|
||||||
"mainly used to reduce the cost of operator dispatch. Currently, "
|
"mainly used to reduce the cost of operator dispatch. Currently, "
|
||||||
"\"FULL_DECODE_ONLY\" is recommended."
|
"\"FULL_DECODE_ONLY\" is recommended."
|
||||||
msgstr "`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和 \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示具体的图模式。目前支持 \"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 \"FULL_DECODE_ONLY\"。"
|
msgstr ""
|
||||||
|
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
|
||||||
|
" \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示具体的图模式。目前支持 "
|
||||||
|
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||||
|
"\"FULL_DECODE_ONLY\"。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:128
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:128
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -229,14 +298,19 @@ msgid ""
|
|||||||
" inputs between levels are automatically padded to the next level. "
|
" inputs between levels are automatically padded to the next level. "
|
||||||
"Currently, the default setting is recommended. Only in some scenarios is "
|
"Currently, the default setting is recommended. Only in some scenarios is "
|
||||||
"it necessary to set this separately to achieve optimal performance."
|
"it necessary to set this separately to achieve optimal performance."
|
||||||
msgstr "\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。仅在部分场景中,需要单独设置此参数以达到最佳性能。"
|
msgstr ""
|
||||||
|
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
|
||||||
|
"40,..., `--max-num-"
|
||||||
|
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。仅在部分场景中,需要单独设置此参数以达到最佳性能。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:129
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:129
|
||||||
msgid ""
|
msgid ""
|
||||||
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
|
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
|
||||||
"optimization is enabled. Currently, this optimization is only supported "
|
"optimization is enabled. Currently, this optimization is only supported "
|
||||||
"for MoE in scenarios where tp_size > 1."
|
"for MoE in scenarios where tp_size > 1."
|
||||||
msgstr "`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 tp_size > 1 的场景下对 MoE 支持。"
|
msgstr ""
|
||||||
|
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 "
|
||||||
|
"tp_size > 1 的场景下对 MoE 支持。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:133
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:133
|
||||||
msgid "tp_size needs to be divisible by dcp_size"
|
msgid "tp_size needs to be divisible by dcp_size"
|
||||||
@@ -246,120 +320,85 @@ msgstr "tp_size 需要能被 dcp_size 整除"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"decode context parallel size must be less than or equal to max_dcp_size, "
|
"decode context parallel size must be less than or equal to max_dcp_size, "
|
||||||
"where max_dcp_size = tensor_parallel_size // total_num_kv_heads."
|
"where max_dcp_size = tensor_parallel_size // total_num_kv_heads."
|
||||||
msgstr "解码上下文并行大小必须小于或等于 max_dcp_size,其中 max_dcp_size = tensor_parallel_size // total_num_kv_heads。"
|
msgstr ""
|
||||||
|
"解码上下文并行大小必须小于或等于 max_dcp_size,其中 max_dcp_size = tensor_parallel_size // "
|
||||||
|
"total_num_kv_heads。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:136
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:136
|
||||||
msgid "Accuracy Evaluation"
|
msgid "Accuracy Evaluation"
|
||||||
msgstr "精度评估"
|
msgstr "精度评估"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:138
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:138
|
||||||
msgid "Here are two accuracy evaluation methods."
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:150
|
||||||
msgstr "以下是两种精度评估方法。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:140
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:152
|
|
||||||
msgid "Using AISBench"
|
msgid "Using AISBench"
|
||||||
msgstr "使用 AISBench"
|
msgstr "使用 AISBench"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:142
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:140
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [Using "
|
"Refer to [Using "
|
||||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||||
"details."
|
"details."
|
||||||
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:144
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:142
|
||||||
msgid ""
|
msgid ""
|
||||||
"After execution, you can get the result, here is the result of `Qwen3"
|
"After execution, you can get the result, here is the result of `Qwen3"
|
||||||
"-235B-A22B-w8a8` for reference only."
|
"-235B-A22B-w8a8` for reference only."
|
||||||
msgstr "执行后,您可以获得结果,以下是 `Qwen3-235B-A22B-w8a8` 的结果,仅供参考。"
|
msgstr "执行后,您可以获得结果,以下是 `Qwen3-235B-A22B-w8a8` 的结果,仅供参考。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "dataset"
|
|
||||||
msgstr "数据集"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "version"
|
|
||||||
msgstr "版本"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "metric"
|
|
||||||
msgstr "指标"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "mode"
|
|
||||||
msgstr "模式"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "vllm-api-general-chat"
|
|
||||||
msgstr "vllm-api-general-chat"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "aime2024"
|
|
||||||
msgstr "aime2024"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "-"
|
|
||||||
msgstr "-"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "accuracy"
|
|
||||||
msgstr "准确率"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
|
||||||
msgid "gen"
|
|
||||||
msgstr "生成"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:21
|
||||||
msgid "83.33"
|
msgid "83.33"
|
||||||
msgstr "83.33"
|
msgstr "83.33"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:150
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:148
|
||||||
msgid "Performance"
|
msgid "Performance"
|
||||||
msgstr "性能"
|
msgstr "性能"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:154
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:152
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [Using AISBench for performance "
|
"Refer to [Using AISBench for performance "
|
||||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
"performance-evaluation) for details."
|
"performance-evaluation) for details."
|
||||||
msgstr "详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
msgstr ""
|
||||||
|
"详情请参阅[使用 AISBench "
|
||||||
|
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
|
"performance-evaluation)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:156
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:154
|
||||||
msgid "Using vLLM Benchmark"
|
msgid "Using vLLM Benchmark"
|
||||||
msgstr "使用 vLLM Benchmark"
|
msgstr "使用 vLLM Benchmark"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:158
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:156
|
||||||
msgid "Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example."
|
msgid "Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example."
|
||||||
msgstr "以运行 `Qwen3-235B-A22B-w8a8` 的性能评估为例。"
|
msgstr "以运行 `Qwen3-235B-A22B-w8a8` 的性能评估为例。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:160
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:158
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
||||||
"for more details."
|
"for more details."
|
||||||
msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:162
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:160
|
||||||
msgid "There are three `vllm bench` subcommands:"
|
msgid "There are three `vllm bench` subcommands:"
|
||||||
msgstr "`vllm bench` 有三个子命令:"
|
msgstr "`vllm bench` 有三个子命令:"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:164
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:162
|
||||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||||
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:165
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:163
|
||||||
msgid "`serve`: Benchmark the online serving throughput."
|
msgid "`serve`: Benchmark the online serving throughput."
|
||||||
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:166
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:164
|
||||||
msgid "`throughput`: Benchmark offline inference throughput."
|
msgid "`throughput`: Benchmark offline inference throughput."
|
||||||
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:168
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:166
|
||||||
msgid "Take the `serve` as an example. Run the code as follows."
|
msgid "Take the `serve` as an example. Run the code as follows."
|
||||||
msgstr "以 `serve` 为例。运行代码如下。"
|
msgstr "以 `serve` 为例。运行代码如下。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:175
|
#: ../../source/tutorials/features/long_sequence_context_parallel_single_node.md:173
|
||||||
msgid ""
|
msgid ""
|
||||||
"After about several minutes, you can get the performance evaluation "
|
"After about several minutes, you can get the performance evaluation "
|
||||||
"result."
|
"result."
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -41,7 +41,10 @@ msgid ""
|
|||||||
"prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and "
|
"prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and "
|
||||||
"the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). "
|
"the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). "
|
||||||
"On each server, use 8 NPUs 16 chips to deploy one service instance."
|
"On each server, use 8 NPUs 16 chips to deploy one service instance."
|
||||||
msgstr "以 Deepseek-r1-w8a8 模型为例,使用 4 台 Atlas 800T A3 服务器部署 \"2P1D\" 架构。假设预填充服务器 IP 为 192.0.0.1(预填充节点 1)和 192.0.0.2(预填充节点 2),解码服务器 IP 为 192.0.0.3(解码节点 1)和 192.0.0.4(解码节点 2)。每台服务器使用 8 个 NPU(16 个芯片)部署一个服务实例。"
|
msgstr ""
|
||||||
|
"以 Deepseek-r1-w8a8 模型为例,使用 4 台 Atlas 800T A3 服务器部署 \"2P1D\" 架构。假设预填充服务器 "
|
||||||
|
"IP 为 192.0.0.1(预填充节点 1)和 192.0.0.2(预填充节点 2),解码服务器 IP 为 192.0.0.3(解码节点 1)和"
|
||||||
|
" 192.0.0.4(解码节点 2)。每台服务器使用 8 个 NPU(16 个芯片)部署一个服务实例。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:9
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:9
|
||||||
msgid "Verify Multi-Node Communication Environment"
|
msgid "Verify Multi-Node Communication Environment"
|
||||||
@@ -137,7 +140,10 @@ msgid ""
|
|||||||
" by Moonshot AI.Installation and Compilation Guide: <https://github.com"
|
" by Moonshot AI.Installation and Compilation Guide: <https://github.com"
|
||||||
"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries> First, we"
|
"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries> First, we"
|
||||||
" need to obtain the Mooncake project. Refer to the following command:"
|
" need to obtain the Mooncake project. Refer to the following command:"
|
||||||
msgstr "Mooncake 是月之暗面(Moonshot AI)提供的领先 LLM 服务 Kimi 的推理平台。安装与编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries> 首先,我们需要获取 Mooncake 项目。参考以下命令:"
|
msgstr ""
|
||||||
|
"Mooncake 是月之暗面(Moonshot AI)提供的领先 LLM 服务 Kimi "
|
||||||
|
"的推理平台。安装与编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file"
|
||||||
|
"#build-and-use-binaries> 首先,我们需要获取 Mooncake 项目。参考以下命令:"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:183
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:183
|
||||||
msgid "(Optional) Replace go install url if the network is poor"
|
msgid "(Optional) Replace go install url if the network is poor"
|
||||||
@@ -185,7 +191,10 @@ msgid ""
|
|||||||
"socket listeners. To avoid any issues, port conflicts should be "
|
"socket listeners. To avoid any issues, port conflicts should be "
|
||||||
"prevented. Additionally, ensure that each node's engine_id is uniquely "
|
"prevented. Additionally, ensure that each node's engine_id is uniquely "
|
||||||
"assigned to avoid conflicts."
|
"assigned to avoid conflicts."
|
||||||
msgstr "我们可以分别运行以下脚本来在预填充器/解码器节点上启动服务器。请注意,每个 P/D 节点将占用从 kv_port 到 kv_port + num_chips 的端口范围来初始化 socket 监听器。为避免问题,应防止端口冲突。此外,请确保每个节点的 engine_id 被唯一分配,以避免冲突。"
|
msgstr ""
|
||||||
|
"我们可以分别运行以下脚本来在预填充器/解码器节点上启动服务器。请注意,每个 P/D 节点将占用从 kv_port 到 kv_port + "
|
||||||
|
"num_chips 的端口范围来初始化 socket 监听器。为避免问题,应防止端口冲突。此外,请确保每个节点的 engine_id "
|
||||||
|
"被唯一分配,以避免冲突。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:227
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:227
|
||||||
msgid "kv_port Configuration Guide"
|
msgid "kv_port Configuration Guide"
|
||||||
@@ -198,7 +207,10 @@ msgid ""
|
|||||||
"npu_per_node × 1000)`. If `kv_port` overlaps with this range, "
|
"npu_per_node × 1000)`. If `kv_port` overlaps with this range, "
|
||||||
"intermittent port conflicts may occur. To avoid this, configure `kv_port`"
|
"intermittent port conflicts may occur. To avoid this, configure `kv_port`"
|
||||||
" according to the table below:"
|
" according to the table below:"
|
||||||
msgstr "在 Ascend NPU 上,Mooncake 使用 AscendDirectTransport 进行 RDMA 数据传输,它会在 `[20000, 20000 + npu_per_node × 1000)` 范围内随机分配端口。如果 `kv_port` 与此范围重叠,可能会发生间歇性端口冲突。为避免此问题,请根据下表配置 `kv_port`:"
|
msgstr ""
|
||||||
|
"在 Ascend NPU 上,Mooncake 使用 AscendDirectTransport 进行 RDMA 数据传输,它会在 "
|
||||||
|
"`[20000, 20000 + npu_per_node × 1000)` 范围内随机分配端口。如果 `kv_port` "
|
||||||
|
"与此范围重叠,可能会发生间歇性端口冲突。为避免此问题,请根据下表配置 `kv_port`:"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:132
|
||||||
msgid "NPUs per Node"
|
msgid "NPUs per Node"
|
||||||
@@ -242,7 +254,9 @@ msgid ""
|
|||||||
"during startup, it may be caused by kv_port conflicting with randomly "
|
"during startup, it may be caused by kv_port conflicting with randomly "
|
||||||
"allocated AscendDirectTransport ports. Increase your kv_port value to "
|
"allocated AscendDirectTransport ports. Increase your kv_port value to "
|
||||||
"avoid the reserved range."
|
"avoid the reserved range."
|
||||||
msgstr "如果在启动时偶尔看到 `zmq.error.ZMQError: Address already in use`,可能是由于 kv_port 与随机分配的 AscendDirectTransport 端口冲突所致。请增加您的 kv_port 值以避开保留范围。"
|
msgstr ""
|
||||||
|
"如果在启动时偶尔看到 `zmq.error.ZMQError: Address already in use`,可能是由于 kv_port "
|
||||||
|
"与随机分配的 AscendDirectTransport 端口冲突所致。请增加您的 kv_port 值以避开保留范围。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:240
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:240
|
||||||
msgid "launch_online_dp.py"
|
msgid "launch_online_dp.py"
|
||||||
@@ -251,9 +265,12 @@ msgstr "launch_online_dp.py"
|
|||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:242
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:242
|
||||||
msgid ""
|
msgid ""
|
||||||
"Use `launch_online_dp.py` to launch external dp vllm servers. "
|
"Use `launch_online_dp.py` to launch external dp vllm servers. "
|
||||||
"[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
"[launch_online_dp.py](https://github.com/vllm-project/vllm-"
|
||||||
|
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||||
|
msgstr ""
|
||||||
|
"使用 `launch_online_dp.py` 启动外部解耦 vllm "
|
||||||
|
"服务器。[launch_online_dp.py](https://github.com/vllm-project/vllm-"
|
||||||
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||||
msgstr "使用 `launch_online_dp.py` 启动外部解耦 vllm 服务器。[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:245
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:245
|
||||||
msgid "run_dp_template.sh"
|
msgid "run_dp_template.sh"
|
||||||
@@ -262,9 +279,12 @@ msgstr "run_dp_template.sh"
|
|||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:247
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:247
|
||||||
msgid ""
|
msgid ""
|
||||||
"Modify `run_dp_template.sh` on each node. "
|
"Modify `run_dp_template.sh` on each node. "
|
||||||
"[run\\_dp\\_template.sh](https://github.com/vllm-project/vllm-"
|
"[run_dp_template.sh](https://github.com/vllm-project/vllm-"
|
||||||
|
"ascend/blob/main/examples/external_online_dp/run_dp_template.sh)"
|
||||||
|
msgstr ""
|
||||||
|
"在每个节点上修改 `run_dp_template.sh`。[run_dp_template.sh](https://github.com"
|
||||||
|
"/vllm-project/vllm-"
|
||||||
"ascend/blob/main/examples/external_online_dp/run_dp_template.sh)"
|
"ascend/blob/main/examples/external_online_dp/run_dp_template.sh)"
|
||||||
msgstr "在每个节点上修改 `run_dp_template.sh`。[run\\_dp\\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:250
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:250
|
||||||
@@ -321,7 +341,12 @@ msgid ""
|
|||||||
"MooncakeLayerwiseConnector.[load\\_balance\\_proxy\\_layerwise\\_server\\_example.py](https://github.com"
|
"MooncakeLayerwiseConnector.[load\\_balance\\_proxy\\_layerwise\\_server\\_example.py](https://github.com"
|
||||||
"/vllm-project/vllm-"
|
"/vllm-project/vllm-"
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)"
|
||||||
msgstr "**`load_balance_proxy_layerwise_server_example.py`**:请求首先被路由到 D 节点,然后根据需要转发到 P 节点。此代理设计用于与 MooncakeLayerwiseConnector 配合使用。[load\\_balance\\_proxy\\_layerwise\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)"
|
msgstr ""
|
||||||
|
"**`load_balance_proxy_layerwise_server_example.py`**:请求首先被路由到 D "
|
||||||
|
"节点,然后根据需要转发到 P 节点。此代理设计用于与 MooncakeLayerwiseConnector "
|
||||||
|
"配合使用。[load_balance_proxy_layerwise_server_example.py](https://github.com"
|
||||||
|
"/vllm-project/vllm-"
|
||||||
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:756
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:756
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -331,7 +356,12 @@ msgid ""
|
|||||||
"MooncakeConnector.[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
"MooncakeConnector.[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||||
"/vllm-project/vllm-"
|
"/vllm-project/vllm-"
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
msgstr "**`load_balance_proxy_server_example.py`**:请求首先被路由到 P 节点,然后转发到 D 节点进行后续处理。此代理设计用于与 MooncakeConnector 配合使用。[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
msgstr ""
|
||||||
|
"**`load_balance_proxy_server_example.py`**:请求首先被路由到 P 节点,然后转发到 D "
|
||||||
|
"节点进行后续处理。此代理设计用于与 MooncakeConnector "
|
||||||
|
"配合使用。[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||||
|
"/vllm-project/vllm-"
|
||||||
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
||||||
msgid "Parameter"
|
msgid "Parameter"
|
||||||
@@ -371,7 +401,7 @@ msgstr "--prefiller-ports"
|
|||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
||||||
msgid "Ports of prefiller nodes"
|
msgid "Ports of prefiller nodes"
|
||||||
msgstr "预填充节点的端口"
|
msgstr "预填充节点端口"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
||||||
msgid "--decoder-hosts"
|
msgid "--decoder-hosts"
|
||||||
@@ -379,7 +409,7 @@ msgstr "--decoder-hosts"
|
|||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
||||||
msgid "Hosts of decoder nodes"
|
msgid "Hosts of decoder nodes"
|
||||||
msgstr "解码器节点的主机地址"
|
msgstr "解码器节点主机地址"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
||||||
msgid "--decoder-ports"
|
msgid "--decoder-ports"
|
||||||
@@ -387,7 +417,7 @@ msgstr "--decoder-ports"
|
|||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:814
|
||||||
msgid "Ports of decoder nodes"
|
msgid "Ports of decoder nodes"
|
||||||
msgstr "解码器节点的端口"
|
msgstr "解码器节点端口"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:877
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:877
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -396,9 +426,8 @@ msgid ""
|
|||||||
"project/vllm-"
|
"project/vllm-"
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"您可以在代码仓库的示例中找到代理程序,"
|
"您可以在代码仓库的示例中找到代理程序,[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
"/vllm-project/vllm-"
|
||||||
"project/vllm-"
|
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:879
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:879
|
||||||
@@ -411,8 +440,8 @@ msgid ""
|
|||||||
"[aisbench](https://gitee.com/aisbench/benchmark) Execute the following "
|
"[aisbench](https://gitee.com/aisbench/benchmark) Execute the following "
|
||||||
"commands to install aisbench"
|
"commands to install aisbench"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"我们推荐使用 aisbench 工具进行性能评估。"
|
"我们推荐使用 aisbench 工具进行性能评估。[aisbench](https://gitee.com/aisbench/benchmark)"
|
||||||
"[aisbench](https://gitee.com/aisbench/benchmark) 执行以下命令安装 aisbench"
|
" 执行以下命令安装 aisbench"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:889
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:889
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -443,7 +472,9 @@ msgstr "以 gsm8k 数据集为例,执行以下命令来评估性能。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"For more details for commands and parameters for aisbench, refer to "
|
"For more details for commands and parameters for aisbench, refer to "
|
||||||
"[aisbench](https://gitee.com/aisbench/benchmark)"
|
"[aisbench](https://gitee.com/aisbench/benchmark)"
|
||||||
msgstr "有关 aisbench 命令和参数的更多详细信息,请参考 [aisbench](https://gitee.com/aisbench/benchmark)"
|
msgstr ""
|
||||||
|
"有关 aisbench 命令和参数的更多详细信息,请参考 "
|
||||||
|
"[aisbench](https://gitee.com/aisbench/benchmark)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:932
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:932
|
||||||
msgid "FAQ"
|
msgid "FAQ"
|
||||||
@@ -459,8 +490,7 @@ msgid ""
|
|||||||
"warm-up to achieve best performance, we recommend preheating the service "
|
"warm-up to achieve best performance, we recommend preheating the service "
|
||||||
"with some requests before conducting performance tests to achieve the "
|
"with some requests before conducting performance tests to achieve the "
|
||||||
"best end-to-end throughput."
|
"best end-to-end throughput."
|
||||||
msgstr ""
|
msgstr "由于部分 NPU 算子的计算需要经过多轮预热才能达到最佳性能,我们建议在进行性能测试前,先用一些请求预热服务,以获得最佳的端到端吞吐量。"
|
||||||
"由于部分 NPU 算子的计算需要经过多轮预热才能达到最佳性能,我们建议在进行性能测试前,先用一些请求预热服务,以获得最佳的端到端吞吐量。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:938
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_multi_node.md:938
|
||||||
msgid "Verification"
|
msgid "Verification"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -24,7 +24,7 @@ msgid "Prefill-Decode Disaggregation (Qwen2.5-VL)"
|
|||||||
msgstr "预填充-解码解耦架构 (Qwen2.5-VL)"
|
msgstr "预填充-解码解耦架构 (Qwen2.5-VL)"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:3
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:3
|
||||||
msgid "Getting Start"
|
msgid "Getting Started"
|
||||||
msgstr "开始使用"
|
msgstr "开始使用"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:5
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:5
|
||||||
@@ -36,10 +36,10 @@ msgstr "vLLM-Ascend 现已支持预填充-解码 (PD) 解耦架构。本指南
|
|||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:7
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:7
|
||||||
msgid ""
|
msgid ""
|
||||||
"Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend "
|
"Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend "
|
||||||
"v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "
|
"v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "
|
||||||
"\"1P1D\" architecture. Assume the IP address is 192.0.0.1."
|
"\"1P1D\" architecture. Assume the IP address is 192.0.0.1."
|
||||||
msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例,在 1 台 Atlas 800T A2 服务器上使用 vllm-ascend v0.11.0rc1 (包含 vLLM v0.11.0) 部署 \"1P1D\" 架构。假设 IP 地址为 192.0.0.1。"
|
msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例,在 1 台 Atlas 800T A2 服务器上使用 vLLM-Ascend v0.11.0rc1 (包含 vLLM v0.11.0) 部署 \"1P1D\" 架构。假设 IP 地址为 192.0.0.1。"
|
||||||
|
|
||||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:9
|
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:9
|
||||||
msgid "Verify Communication Environment"
|
msgid "Verify Communication Environment"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -35,7 +35,8 @@ msgid ""
|
|||||||
"language understanding with advanced agentic capabilities, instant and "
|
"language understanding with advanced agentic capabilities, instant and "
|
||||||
"thinking modes, as well as conversational and agentic paradigms."
|
"thinking modes, as well as conversational and agentic paradigms."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"Kimi K2.5 是一个开源的、原生的多模态智能体模型,通过在 Kimi-K2-Base 基础上持续预训练约 15 万亿视觉和文本混合令牌构建而成。它无缝集成了视觉与语言理解能力、先进的智能体能力、即时与思考模式,以及对话式和智能体范式。"
|
"Kimi K2.5 是一个开源的、原生的多模态智能体模型,通过在 Kimi-K2-Base 基础上持续预训练约 15 "
|
||||||
|
"万亿视觉和文本混合令牌构建而成。它无缝集成了视觉与语言理解能力、先进的智能体能力、即时与思考模式,以及对话式和智能体范式。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:7
|
#: ../../source/tutorials/models/Kimi-K2.5.md:7
|
||||||
msgid "The `Kimi-K2.5` model is first supported in `vllm-ascend:v0.17.0rc1`."
|
msgid "The `Kimi-K2.5` model is first supported in `vllm-ascend:v0.17.0rc1`."
|
||||||
@@ -58,7 +59,9 @@ msgid ""
|
|||||||
"Refer to [supported "
|
"Refer to [supported "
|
||||||
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
||||||
" model's supported feature matrix."
|
" model's supported feature matrix."
|
||||||
msgstr "请参考 [支持的特性](../../user_guide/support_matrix/supported_models.md) 获取模型支持的特性矩阵。"
|
msgstr ""
|
||||||
|
"请参考 [支持的特性](../../user_guide/support_matrix/supported_models.md) "
|
||||||
|
"获取模型支持的特性矩阵。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:15
|
#: ../../source/tutorials/models/Kimi-K2.5.md:15
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -78,14 +81,18 @@ msgstr "模型权重"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"`Kimi-K2.5-w4a8`(Quantized version for w4a8): [Download model "
|
"`Kimi-K2.5-w4a8`(Quantized version for w4a8): [Download model "
|
||||||
"weight](https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8)."
|
"weight](https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8)."
|
||||||
msgstr "`Kimi-K2.5-w4a8`(w4a8量化版本):[下载模型权重](https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8)。"
|
msgstr ""
|
||||||
|
"`Kimi-K2.5-w4a8`(w4a8量化版本):[下载模型权重](https://modelscope.cn/models/Eco-"
|
||||||
|
"Tech/Kimi-K2.5-W4A8)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:22
|
#: ../../source/tutorials/models/Kimi-K2.5.md:22
|
||||||
msgid ""
|
msgid ""
|
||||||
"`kimi-k2.5-eagle3`(Eagle3 MTP draft model for accelerating inference of "
|
"`kimi-k2.5-eagle3`(Eagle3 MTP draft model for accelerating inference of "
|
||||||
"Kimi-K2.5): [Download model "
|
"Kimi-K2.5): [Download model "
|
||||||
"weight](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3)"
|
"weight](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3)"
|
||||||
msgstr "`kimi-k2.5-eagle3`(用于加速 Kimi-K2.5 推理的 Eagle3 MTP 草稿模型):[下载模型权重](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3)"
|
msgstr ""
|
||||||
|
"`kimi-k2.5-eagle3`(用于加速 Kimi-K2.5 推理的 Eagle3 MTP "
|
||||||
|
"草稿模型):[下载模型权重](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:24
|
#: ../../source/tutorials/models/Kimi-K2.5.md:24
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -102,7 +109,9 @@ msgid ""
|
|||||||
"If you want to deploy multi-node environment, you need to verify multi-"
|
"If you want to deploy multi-node environment, you need to verify multi-"
|
||||||
"node communication according to [verify multi-node communication "
|
"node communication according to [verify multi-node communication "
|
||||||
"environment](../../installation.md#verify-multi-node-communication)."
|
"environment](../../installation.md#verify-multi-node-communication)."
|
||||||
msgstr "如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation.md#verify-multi-node-communication) 验证多节点通信。"
|
msgstr ""
|
||||||
|
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||||
|
"communication) 验证多节点通信。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:30
|
#: ../../source/tutorials/models/Kimi-K2.5.md:30
|
||||||
msgid "Installation"
|
msgid "Installation"
|
||||||
@@ -117,21 +126,26 @@ msgid ""
|
|||||||
"Select an image based on your machine type and start the docker image on "
|
"Select an image based on your machine type and start the docker image on "
|
||||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||||
"docker)."
|
"docker)."
|
||||||
msgstr "根据您的机器类型选择镜像,并在节点上启动 docker 镜像,请参考 [使用 docker](../../installation.md#set-up-using-docker)。"
|
msgstr ""
|
||||||
|
"根据您的机器类型选择镜像,并在节点上启动 docker 镜像,请参考 [使用 docker](../../installation.md#set-"
|
||||||
|
"up-using-docker)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md
|
#: ../../source/tutorials/models/Kimi-K2.5.md:36
|
||||||
msgid "A3 series"
|
msgid "A3 series"
|
||||||
msgstr "A3 系列"
|
msgstr "A3 系列"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:43
|
#: ../../source/tutorials/models/Kimi-K2.5.md:43
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:86
|
|
||||||
msgid "Start the docker image on your each node."
|
msgid "Start the docker image on your each node."
|
||||||
msgstr "在您的每个节点上启动 docker 镜像。"
|
msgstr "在您的每个节点上启动 docker 镜像。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md
|
#: ../../source/tutorials/models/Kimi-K2.5.md:45
|
||||||
msgid "A2 series"
|
msgid "A2 series"
|
||||||
msgstr "A2 系列"
|
msgstr "A2 系列"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/models/Kimi-K2.5.md:86
|
||||||
|
msgid "Start the docker image on your each node."
|
||||||
|
msgstr "在您的每个节点上启动 docker 镜像。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:119
|
#: ../../source/tutorials/models/Kimi-K2.5.md:119
|
||||||
msgid ""
|
msgid ""
|
||||||
"In addition, if you don't want to use the docker image as above, you can "
|
"In addition, if you don't want to use the docker image as above, you can "
|
||||||
@@ -169,7 +183,6 @@ msgid "Run the following script to execute online inference."
|
|||||||
msgstr "运行以下脚本执行在线推理。"
|
msgstr "运行以下脚本执行在线推理。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:176
|
#: ../../source/tutorials/models/Kimi-K2.5.md:176
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:645
|
|
||||||
msgid "**Notice:** The parameters are explained as follows:"
|
msgid "**Notice:** The parameters are explained as follows:"
|
||||||
msgstr "**注意:** 参数解释如下:"
|
msgstr "**注意:** 参数解释如下:"
|
||||||
|
|
||||||
@@ -180,7 +193,9 @@ msgid ""
|
|||||||
"reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios."
|
"reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios."
|
||||||
" Furthermore, enabling this feature is not recommended in scenarios where"
|
" Furthermore, enabling this feature is not recommended in scenarios where"
|
||||||
" PD is separated."
|
" PD is separated."
|
||||||
msgstr "设置环境变量 `VLLM_ASCEND_BALANCE_SCHEDULING=1` 启用均衡调度。这可能有助于提高 v1 调度器中的输出吞吐量并降低 TPOT。然而,在某些场景下 TTFT 可能会下降。此外,在 PD 分离的场景中不建议启用此功能。"
|
msgstr ""
|
||||||
|
"设置环境变量 `VLLM_ASCEND_BALANCE_SCHEDULING=1` 启用均衡调度。这可能有助于提高 v1 "
|
||||||
|
"调度器中的输出吞吐量并降低 TPOT。然而,在某些场景下 TTFT 可能会下降。此外,在 PD 分离的场景中不建议启用此功能。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:180
|
#: ../../source/tutorials/models/Kimi-K2.5.md:180
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -195,7 +210,9 @@ msgid ""
|
|||||||
" with an input length of 3.5K and output length of 1.5K, a value of "
|
" with an input length of 3.5K and output length of 1.5K, a value of "
|
||||||
"`16384` is sufficient, however, for precision testing, please set it at "
|
"`16384` is sufficient, however, for precision testing, please set it at "
|
||||||
"least `35000`."
|
"least `35000`."
|
||||||
msgstr "`--max-model-len` 指定最大上下文长度——即单个请求的输入和输出令牌总数。对于输入长度 3.5K 和输出长度 1.5K 的性能测试,`16384` 的值就足够了,但对于精度测试,请至少将其设置为 `35000`。"
|
msgstr ""
|
||||||
|
"`--max-model-len` 指定最大上下文长度——即单个请求的输入和输出令牌总数。对于输入长度 3.5K 和输出长度 1.5K "
|
||||||
|
"的性能测试,`16384` 的值就足够了,但对于精度测试,请至少将其设置为 `35000`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:182
|
#: ../../source/tutorials/models/Kimi-K2.5.md:182
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -244,14 +261,18 @@ msgstr "Prefill-Decode 分离"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"We recommend using Mooncake for deployment: "
|
"We recommend using Mooncake for deployment: "
|
||||||
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
||||||
msgstr "我们建议使用 Mooncake 进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
msgstr ""
|
||||||
|
"我们建议使用 Mooncake "
|
||||||
|
"进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:326
|
#: ../../source/tutorials/models/Kimi-K2.5.md:326
|
||||||
msgid ""
|
msgid ""
|
||||||
"Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 "
|
"Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 "
|
||||||
"nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory "
|
"nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory "
|
||||||
"to serve high concurrency in 1P1D case."
|
"to serve high concurrency in 1P1D case."
|
||||||
msgstr "以 Atlas 800 A3(64G × 16)为例,我们建议部署 2P1D(4 个节点)而不是 1P1D(2 个节点),因为在 1P1D 情况下没有足够的 NPU 内存来服务高并发。"
|
msgstr ""
|
||||||
|
"以 Atlas 800 A3(64G × 16)为例,我们建议部署 2P1D(4 个节点)而不是 1P1D(2 个节点),因为在 1P1D "
|
||||||
|
"情况下没有足够的 NPU 内存来服务高并发。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:328
|
#: ../../source/tutorials/models/Kimi-K2.5.md:328
|
||||||
msgid "`Kimi-K2.5-w4a8 2P1D` require 4 Atlas 800 A3 (64G × 16)."
|
msgid "`Kimi-K2.5-w4a8 2P1D` require 4 Atlas 800 A3 (64G × 16)."
|
||||||
@@ -263,14 +284,20 @@ msgid ""
|
|||||||
"to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` "
|
"to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` "
|
||||||
"script on each node and deploy a `proxy.sh` script on prefill master node"
|
"script on each node and deploy a `proxy.sh` script on prefill master node"
|
||||||
" to forward requests."
|
" to forward requests."
|
||||||
msgstr "要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署一个 `launch_dp_program.py` 脚本和一个 `run_dp_template.sh` 脚本,并在 prefill 主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
msgstr ""
|
||||||
|
"要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署一个 "
|
||||||
|
"`launch_dp_program.py` 脚本和一个 `run_dp_template.sh` 脚本,并在 prefill 主节点上部署一个 "
|
||||||
|
"`proxy.sh` 脚本来转发请求。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:332
|
#: ../../source/tutorials/models/Kimi-K2.5.md:332
|
||||||
msgid ""
|
msgid ""
|
||||||
"`launch_online_dp.py` to launch external dp vllm servers. "
|
"`launch_online_dp.py` to launch external dp vllm servers. "
|
||||||
"[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
"[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
||||||
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||||
msgstr "`launch_online_dp.py` 用于启动外部 dp vllm 服务器。[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
msgstr ""
|
||||||
|
"`launch_online_dp.py` 用于启动外部 dp vllm "
|
||||||
|
"服务器。[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
||||||
|
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:335
|
#: ../../source/tutorials/models/Kimi-K2.5.md:335
|
||||||
msgid "Prefill Node 0 `run_dp_template.sh` script"
|
msgid "Prefill Node 0 `run_dp_template.sh` script"
|
||||||
@@ -288,6 +315,10 @@ msgstr "Decode 节点 0 `run_dp_template.sh` 脚本"
|
|||||||
msgid "Decode Node 1 `run_dp_template.sh` script"
|
msgid "Decode Node 1 `run_dp_template.sh` script"
|
||||||
msgstr "Decode 节点 1 `run_dp_template.sh` 脚本"
|
msgstr "Decode 节点 1 `run_dp_template.sh` 脚本"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/models/Kimi-K2.5.md:645
|
||||||
|
msgid "**Notice:** The parameters are explained as follows:"
|
||||||
|
msgstr "**注意:** 参数解释如下:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:648
|
#: ../../source/tutorials/models/Kimi-K2.5.md:648
|
||||||
msgid ""
|
msgid ""
|
||||||
"`VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization"
|
"`VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization"
|
||||||
@@ -300,7 +331,9 @@ msgid ""
|
|||||||
"significantly improve performance but consumes more NPU memory. In the "
|
"significantly improve performance but consumes more NPU memory. In the "
|
||||||
"Prefill-Decode (PD) separation scenario, enable MLAPO only on decode "
|
"Prefill-Decode (PD) separation scenario, enable MLAPO only on decode "
|
||||||
"nodes."
|
"nodes."
|
||||||
msgstr "`VLLM_ASCEND_ENABLE_MLAPO=1`:启用融合算子,这可以显著提高性能但会消耗更多 NPU 内存。在 Prefill-Decode(PD)分离场景中,仅在 decode 节点上启用 MLAPO。"
|
msgstr ""
|
||||||
|
"`VLLM_ASCEND_ENABLE_MLAPO=1`:启用融合算子,这可以显著提高性能但会消耗更多 NPU 内存。在 Prefill-"
|
||||||
|
"Decode(PD)分离场景中,仅在 decode 节点上启用 MLAPO。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:650
|
#: ../../source/tutorials/models/Kimi-K2.5.md:650
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -316,7 +349,9 @@ msgid ""
|
|||||||
"the min is `n = 1` and the max is `n = max-num-seqs`. For other values, "
|
"the min is `n = 1` and the max is `n = max-num-seqs`. For other values, "
|
||||||
"it is recommended to set them to the number of frequently occurring "
|
"it is recommended to set them to the number of frequently occurring "
|
||||||
"requests on the Decode (D) node."
|
"requests on the Decode (D) node."
|
||||||
msgstr "`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大值为 `n = max-num-seqs`。对于其他值,建议将其设置为 Decode(D)节点上频繁出现的请求数量。"
|
msgstr ""
|
||||||
|
"`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大值为 `n = "
|
||||||
|
"max-num-seqs`。对于其他值,建议将其设置为 Decode(D)节点上频繁出现的请求数量。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:652
|
#: ../../source/tutorials/models/Kimi-K2.5.md:652
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -325,7 +360,8 @@ msgid ""
|
|||||||
"requests will be sent to the prefill node to recompute the KV Cache. In "
|
"requests will be sent to the prefill node to recompute the KV Cache. In "
|
||||||
"the PD separation scenario, it is recommended to enable this "
|
"the PD separation scenario, it is recommended to enable this "
|
||||||
"configuration on both prefill and decode nodes simultaneously."
|
"configuration on both prefill and decode nodes simultaneously."
|
||||||
msgstr "`recompute_scheduler_enable: true`:启用重计算调度器。当 decode 节点的键值缓存(KV Cache)不足时,请求将被发送到 prefill 节点以重新计算 KV Cache。在 PD 分离场景中,建议同时在 prefill 和 decode 节点上启用此配置。"
|
msgstr ""
|
||||||
|
"`recompute_scheduler_enable: true`:启用重计算调度器。当解码节点的键值缓存(KV Cache)不足时,请求将被发送到预填充节点以重新计算 KV Cache。在 PD 分离场景中,建议同时在预填充和解码节点上启用此配置。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:653
|
#: ../../source/tutorials/models/Kimi-K2.5.md:653
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -333,7 +369,8 @@ msgid ""
|
|||||||
"(TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream "
|
"(TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream "
|
||||||
"is enabled to overlap the computation process of shared experts for "
|
"is enabled to overlap the computation process of shared experts for "
|
||||||
"improved efficiency."
|
"improved efficiency."
|
||||||
msgstr "`multistream_overlap_shared_expert: true`:当张量并行(TP)大小为 1 或 `enable_shared_expert_dp: true` 时,启用额外的流来重叠共享专家的计算过程以提高效率。"
|
msgstr ""
|
||||||
|
"`multistream_overlap_shared_expert: true`:当张量并行(TP)大小为 1 或 `enable_shared_expert_dp: true` 时,启用额外的流来重叠共享专家的计算过程以提高效率。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:655
|
#: ../../source/tutorials/models/Kimi-K2.5.md:655
|
||||||
msgid "run server for each node:"
|
msgid "run server for each node:"
|
||||||
@@ -341,7 +378,7 @@ msgstr "为每个节点运行服务器:"
|
|||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:668
|
#: ../../source/tutorials/models/Kimi-K2.5.md:668
|
||||||
msgid "Run the `proxy.sh` script on the prefill master node"
|
msgid "Run the `proxy.sh` script on the prefill master node"
|
||||||
msgstr "在 prefill 主节点上运行 `proxy.sh` 脚本"
|
msgstr "在预填充主节点上运行 `proxy.sh` 脚本"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:670
|
#: ../../source/tutorials/models/Kimi-K2.5.md:670
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -350,7 +387,8 @@ msgid ""
|
|||||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||||
"project/vllm-"
|
"project/vllm-"
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
msgstr "在与 prefiller 服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
msgstr ""
|
||||||
|
"在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:726
|
#: ../../source/tutorials/models/Kimi-K2.5.md:726
|
||||||
msgid "Functional Verification"
|
msgid "Functional Verification"
|
||||||
@@ -567,8 +605,8 @@ msgid ""
|
|||||||
msgstr "**问:启动失败,提示 HCCL 端口冲突(地址已被占用)。我该怎么办?**"
|
msgstr "**问:启动失败,提示 HCCL 端口冲突(地址已被占用)。我该怎么办?**"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:812
|
#: ../../source/tutorials/models/Kimi-K2.5.md:812
|
||||||
msgid "A: Clean up old processes and restart: `pkill -f VLLM*`."
|
msgid "A: Clean up old processes and restart: `pkill -f vLLM*`."
|
||||||
msgstr "答:清理旧进程并重启:`pkill -f VLLM*`。"
|
msgstr "答:清理旧进程并重启:`pkill -f vLLM*`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Kimi-K2.5.md:814
|
#: ../../source/tutorials/models/Kimi-K2.5.md:814
|
||||||
msgid "**Q: How to handle OOM or unstable startup?**"
|
msgid "**Q: How to handle OOM or unstable startup?**"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -42,7 +42,9 @@ msgid ""
|
|||||||
"including supported features, feature configuration, environment "
|
"including supported features, feature configuration, environment "
|
||||||
"preparation, single-NPU and multi-NPU deployment, accuracy and "
|
"preparation, single-NPU and multi-NPU deployment, accuracy and "
|
||||||
"performance evaluation."
|
"performance evaluation."
|
||||||
msgstr "`Qwen2.5-Omni` 模型自 `vllm-ascend:v0.11.0rc0` 版本起获得支持。本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单NPU和多NPU部署、精度和性能评估。"
|
msgstr ""
|
||||||
|
"`Qwen2.5-Omni` 模型自 `vllm-ascend:v0.11.0rc0` "
|
||||||
|
"版本起获得支持。本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单NPU和多NPU部署、精度和性能评估。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:9
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:9
|
||||||
msgid "Supported Features"
|
msgid "Supported Features"
|
||||||
@@ -73,13 +75,17 @@ msgstr "模型权重"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"`Qwen2.5-Omni-3B`(BF16): [Download model "
|
"`Qwen2.5-Omni-3B`(BF16): [Download model "
|
||||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B)"
|
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B)"
|
||||||
msgstr "`Qwen2.5-Omni-3B`(BF16): [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B)"
|
msgstr ""
|
||||||
|
"`Qwen2.5-Omni-3B`(BF16): "
|
||||||
|
"[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:20
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:20
|
||||||
msgid ""
|
msgid ""
|
||||||
"`Qwen2.5-Omni-7B`(BF16): [Download model "
|
"`Qwen2.5-Omni-7B`(BF16): [Download model "
|
||||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B)"
|
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B)"
|
||||||
msgstr "`Qwen2.5-Omni-7B`(BF16): [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B)"
|
msgstr ""
|
||||||
|
"`Qwen2.5-Omni-7B`(BF16): "
|
||||||
|
"[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:22
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:22
|
||||||
msgid "Following examples use the 7B version by default."
|
msgid "Following examples use the 7B version by default."
|
||||||
@@ -98,7 +104,9 @@ msgid ""
|
|||||||
"Select an image based on your machine type and start the docker image on "
|
"Select an image based on your machine type and start the docker image on "
|
||||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||||
"docker)."
|
"docker)."
|
||||||
msgstr "根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-docker)。"
|
msgstr ""
|
||||||
|
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-"
|
||||||
|
"up-using-docker)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:65
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:65
|
||||||
msgid "Deployment"
|
msgid "Deployment"
|
||||||
@@ -114,18 +122,22 @@ msgstr "单 NPU (Qwen2.5-Omni-7B)"
|
|||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:72
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:72
|
||||||
msgid ""
|
msgid ""
|
||||||
"The **environment variable** `LOCAL_MEDIA_PATH` which **allows** API "
|
"The environment variable `LOCAL_MEDIA_PATH` which allows API requests to "
|
||||||
"requests to read local images or videos from directories specified by the"
|
"read local images or videos from directories specified by the server file"
|
||||||
" server file system. Please note this is a security risk. Should only be "
|
" system. Please note this is a security risk. Should only be enabled in "
|
||||||
"enabled in trusted environments."
|
"trusted environments."
|
||||||
msgstr "**环境变量** `LOCAL_MEDIA_PATH` **允许** API 请求从服务器文件系统指定的目录读取本地图像或视频。请注意,这存在安全风险。应仅在受信任的环境中启用。"
|
msgstr ""
|
||||||
|
"环境变量 `LOCAL_MEDIA_PATH` 允许 API "
|
||||||
|
"请求从服务器文件系统指定的目录读取本地图像或视频。请注意,这存在安全风险。应仅在受信任的环境中启用。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:92
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:92
|
||||||
msgid ""
|
msgid ""
|
||||||
"Now vllm-ascend docker image should contain vllm[audio] build part, if "
|
"Now vllm-ascend docker image should contain vllm[audio] build part, if "
|
||||||
"you encounter *audio not supported issue* by any chance, please re-build "
|
"you encounter *audio not supported issue* by any chance, please re-build "
|
||||||
"vllm with [audio] flag."
|
"vllm with [audio] flag."
|
||||||
msgstr "当前 vllm-ascend docker 镜像应包含 vllm[audio] 构建部分,如果您遇到*音频不支持的问题*,请使用 [audio] 标志重新构建 vllm。"
|
msgstr ""
|
||||||
|
"当前 vllm-ascend docker 镜像应包含 vllm[audio] 构建部分,如果您遇到*音频不支持的问题*,请使用 [audio] "
|
||||||
|
"标志重新构建 vllm。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:100
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:100
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -162,8 +174,8 @@ msgid "Functional Verification"
|
|||||||
msgstr "功能验证"
|
msgstr "功能验证"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:131
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:131
|
||||||
msgid "If your service **starts** successfully, you can see the info shown below:"
|
msgid "If your service starts successfully, you can see the info shown below:"
|
||||||
msgstr "如果您的服务**启动**成功,您可以看到如下所示的信息:"
|
msgstr "如果您的服务启动成功,您可以看到如下所示的信息:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:139
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:139
|
||||||
msgid "Once your server is started, you can query the model with input prompts:"
|
msgid "Once your server is started, you can query the model with input prompts:"
|
||||||
@@ -258,7 +270,10 @@ msgid ""
|
|||||||
"Refer to [Using AISBench for performance "
|
"Refer to [Using AISBench for performance "
|
||||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
"performance-evaluation) for details."
|
"performance-evaluation) for details."
|
||||||
msgstr "详情请参考[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
msgstr ""
|
||||||
|
"详情请参考[使用 AISBench "
|
||||||
|
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
|
"performance-evaluation)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen2.5-Omni.md:194
|
#: ../../source/tutorials/models/Qwen2.5-Omni.md:194
|
||||||
msgid "Using vLLM Benchmark"
|
msgid "Using vLLM Benchmark"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -35,9 +35,8 @@ msgid ""
|
|||||||
"advancements in reasoning, instruction-following, agent capabilities, and"
|
"advancements in reasoning, instruction-following, agent capabilities, and"
|
||||||
" multilingual support."
|
" multilingual support."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"Qwen3 是 Qwen 系列最新一代的大语言模型,提供了一套完整的稠密模型和专家混合"
|
"Qwen3 是 Qwen 系列最新一代的大语言模型,提供了一套完整的稠密模型和专家混合(MoE) 模型。基于广泛的训练,Qwen3 "
|
||||||
"(MoE) 模型。基于广泛的训练,Qwen3 在推理、指令遵循、智能体能力和多语言支持方"
|
"在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展。"
|
||||||
"面实现了突破性进展。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:7
|
#: ../../source/tutorials/models/Qwen3-Dense.md:7
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -47,18 +46,15 @@ msgid ""
|
|||||||
"optimization points. We will also explore how adjusting service "
|
"optimization points. We will also explore how adjusting service "
|
||||||
"parameters can maximize throughput performance across various scenarios."
|
"parameters can maximize throughput performance across various scenarios."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"欢迎阅读在 vLLM-Ascend 环境中优化 Qwen 稠密模型的教程。本指南将帮助您为您的用"
|
"欢迎阅读在 vLLM-Ascend 环境中优化 Qwen "
|
||||||
"例配置最有效的设置,并通过实际示例突出关键优化点。我们还将探讨如何调整服务参"
|
"稠密模型的教程。本指南将帮助您为您的用例配置最有效的设置,并通过实际示例突出关键优化点。我们还将探讨如何调整服务参数以在各种场景下最大化吞吐性能。"
|
||||||
"数以在各种场景下最大化吞吐性能。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:9
|
#: ../../source/tutorials/models/Qwen3-Dense.md:9
|
||||||
msgid ""
|
msgid ""
|
||||||
"This document will show the main verification steps of the model, "
|
"This document will show the main verification steps of the model, "
|
||||||
"including supported features, feature configuration, environment "
|
"including supported features, feature configuration, environment "
|
||||||
"preparation, accuracy and performance evaluation."
|
"preparation, accuracy and performance evaluation."
|
||||||
msgstr ""
|
msgstr "本文档将展示模型的主要验证步骤,包括支持的特性、特性配置、环境准备、精度和性能评估。"
|
||||||
"本文档将展示模型的主要验证步骤,包括支持的特性、特性配置、环境准备、精度和性"
|
|
||||||
"能评估。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:11
|
#: ../../source/tutorials/models/Qwen3-Dense.md:11
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -68,11 +64,9 @@ msgid ""
|
|||||||
"20250429). This example requires version **v0.11.0rc2**. Earlier versions"
|
"20250429). This example requires version **v0.11.0rc2**. Earlier versions"
|
||||||
" may lack certain features."
|
" may lack certain features."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"Qwen3 稠密模型首次在 "
|
"Qwen3 稠密模型首次在 [v0.8.4rc2](https://github.com/vllm-project/vllm-"
|
||||||
"[v0.8.4rc2](https://github.com/vllm-project/vllm-"
|
|
||||||
"ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---"
|
"ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---"
|
||||||
"20250429) 中得到支持。本示例需要版本 **v0.11.0rc2**。更早的版本可能缺少某些特"
|
"20250429) 中得到支持。本示例需要版本 **v0.11.0rc2**。更早的版本可能缺少某些特性。"
|
||||||
"性。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:13
|
#: ../../source/tutorials/models/Qwen3-Dense.md:13
|
||||||
msgid "Supported Features"
|
msgid "Supported Features"
|
||||||
@@ -84,16 +78,14 @@ msgid ""
|
|||||||
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
||||||
" model's supported feature matrix."
|
" model's supported feature matrix."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"请参考 [支持的特性](../../user_guide/support_matrix/supported_models."
|
"请参考 [支持的特性](../../user_guide/support_matrix/supported_models.md) "
|
||||||
"md) 以获取模型支持的特性矩阵。"
|
"以获取模型支持的特性矩阵。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:17
|
#: ../../source/tutorials/models/Qwen3-Dense.md:17
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [feature guide](../../user_guide/feature_guide/index.md) to get "
|
"Refer to [feature guide](../../user_guide/feature_guide/index.md) to get "
|
||||||
"the feature's configuration."
|
"the feature's configuration."
|
||||||
msgstr ""
|
msgstr "请参考 [特性指南](../../user_guide/feature_guide/index.md) 以获取特性的配置信息。"
|
||||||
"请参考 [特性指南](../../user_guide/feature_guide/index.md) 以获取特性的配置信"
|
|
||||||
"息。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:19
|
#: ../../source/tutorials/models/Qwen3-Dense.md:19
|
||||||
msgid "Environment Preparation"
|
msgid "Environment Preparation"
|
||||||
@@ -109,9 +101,9 @@ msgid ""
|
|||||||
"Atlas 800I A2 (64G × 1) card. [Download model "
|
"Atlas 800I A2 (64G × 1) card. [Download model "
|
||||||
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-0.6B)"
|
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-0.6B)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-0.6B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas "
|
"`Qwen3-0.6B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas 800I A2"
|
||||||
"800I A2 (64G × 1) 卡。[下载模型权重](https://modelers.cn/models/"
|
" (64G × 1) "
|
||||||
"Modelers_Park/Qwen3-0.6B)"
|
"卡。[下载模型权重](https://modelers.cn/models/Modelers_Park/Qwen3-0.6B)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:24
|
#: ../../source/tutorials/models/Qwen3-Dense.md:24
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -119,9 +111,9 @@ msgid ""
|
|||||||
"Atlas 800I A2 (64G × 1) card. [Download model "
|
"Atlas 800I A2 (64G × 1) card. [Download model "
|
||||||
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-1.7B)"
|
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-1.7B)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-1.7B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas "
|
"`Qwen3-1.7B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas 800I A2"
|
||||||
"800I A2 (64G × 1) 卡。[下载模型权重](https://modelers.cn/models/"
|
" (64G × 1) "
|
||||||
"Modelers_Park/Qwen3-1.7B)"
|
"卡。[下载模型权重](https://modelers.cn/models/Modelers_Park/Qwen3-1.7B)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:25
|
#: ../../source/tutorials/models/Qwen3-Dense.md:25
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -129,9 +121,8 @@ msgid ""
|
|||||||
"Atlas 800I A2 (64G × 1) card. [Download model "
|
"Atlas 800I A2 (64G × 1) card. [Download model "
|
||||||
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-4B)"
|
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-4B)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-4B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas "
|
"`Qwen3-4B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas 800I A2 "
|
||||||
"800I A2 (64G × 1) 卡。[下载模型权重](https://modelers.cn/models/"
|
"(64G × 1) 卡。[下载模型权重](https://modelers.cn/models/Modelers_Park/Qwen3-4B)"
|
||||||
"Modelers_Park/Qwen3-4B)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:26
|
#: ../../source/tutorials/models/Qwen3-Dense.md:26
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -139,9 +130,8 @@ msgid ""
|
|||||||
"Atlas 800I A2 (64G × 1) card. [Download model "
|
"Atlas 800I A2 (64G × 1) card. [Download model "
|
||||||
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-8B)"
|
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-8B)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-8B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas "
|
"`Qwen3-8B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 1 张 Atlas 800I A2 "
|
||||||
"800I A2 (64G × 1) 卡。[下载模型权重](https://modelers.cn/models/"
|
"(64G × 1) 卡。[下载模型权重](https://modelers.cn/models/Modelers_Park/Qwen3-8B)"
|
||||||
"Modelers_Park/Qwen3-8B)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:27
|
#: ../../source/tutorials/models/Qwen3-Dense.md:27
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -149,9 +139,8 @@ msgid ""
|
|||||||
"Atlas 800I A2 (64G × 1) cards. [Download model "
|
"Atlas 800I A2 (64G × 1) cards. [Download model "
|
||||||
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-14B)"
|
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-14B)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-14B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 2 张 Atlas "
|
"`Qwen3-14B`(BF16 版本): 需要 1 张 Atlas 800 A3 (64G × 2) 卡或 2 张 Atlas 800I A2 "
|
||||||
"800I A2 (64G × 1) 卡。[下载模型权重](https://modelers.cn/models/"
|
"(64G × 1) 卡。[下载模型权重](https://modelers.cn/models/Modelers_Park/Qwen3-14B)"
|
||||||
"Modelers_Park/Qwen3-14B)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:28
|
#: ../../source/tutorials/models/Qwen3-Dense.md:28
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -159,9 +148,8 @@ msgid ""
|
|||||||
"Atlas 800I A2 (64G × 4) cards. [Download model "
|
"Atlas 800I A2 (64G × 4) cards. [Download model "
|
||||||
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-32B)"
|
"weight](https://modelers.cn/models/Modelers_Park/Qwen3-32B)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-32B`(BF16 版本): 需要 2 张 Atlas 800 A3 (64G × 4) 卡或 4 张 Atlas "
|
"`Qwen3-32B`(BF16 版本): 需要 2 张 Atlas 800 A3 (64G × 4) 卡或 4 张 Atlas 800I A2 "
|
||||||
"800I A2 (64G × 4) 卡。[下载模型权重](https://modelers.cn/models/"
|
"(64G × 4) 卡。[下载模型权重](https://modelers.cn/models/Modelers_Park/Qwen3-32B)"
|
||||||
"Modelers_Park/Qwen3-32B)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:29
|
#: ../../source/tutorials/models/Qwen3-Dense.md:29
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -169,9 +157,9 @@ msgid ""
|
|||||||
"cards or 4 Atlas 800I A2 (64G × 4) cards. [Download model "
|
"cards or 4 Atlas 800I A2 (64G × 4) cards. [Download model "
|
||||||
"weight](https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W8A8)"
|
"weight](https://www.modelscope.cn/models/vllm-ascend/Qwen3-32B-W8A8)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-32B-W8A8`(量化版本): 需要 2 张 Atlas 800 A3 (64G × 4) 卡或 4 张 "
|
"`Qwen3-32B-W8A8`(量化版本): 需要 2 张 Atlas 800 A3 (64G × 4) 卡或 4 张 Atlas 800I "
|
||||||
"Atlas 800I A2 (64G × 4) 卡。[下载模型权重](https://www.modelscope.cn/"
|
"A2 (64G × 4) 卡。[下载模型权重](https://www.modelscope.cn/models/vllm-"
|
||||||
"models/vllm-ascend/Qwen3-32B-W8A8)"
|
"ascend/Qwen3-32B-W8A8)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:31
|
#: ../../source/tutorials/models/Qwen3-Dense.md:31
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -195,8 +183,8 @@ msgid ""
|
|||||||
"node communication according to [verify multi-node communication "
|
"node communication according to [verify multi-node communication "
|
||||||
"environment](../../installation.md#verify-multi-node-communication)."
|
"environment](../../installation.md#verify-multi-node-communication)."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation."
|
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||||
"md#verify-multi-node-communication) 来验证多节点通信。"
|
"communication) 来验证多节点通信。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:39
|
#: ../../source/tutorials/models/Qwen3-Dense.md:39
|
||||||
msgid "Installation"
|
msgid "Installation"
|
||||||
@@ -208,8 +196,9 @@ msgid ""
|
|||||||
"Currently, we provide the all-in-one images.[Download "
|
"Currently, we provide the all-in-one images.[Download "
|
||||||
"images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)"
|
"images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"您可以使用我们的官方 docker 镜像来支持 Qwen3 稠密模型。目前,我们提供一体化镜"
|
"您可以使用我们的官方 docker 镜像来支持 Qwen3 "
|
||||||
"像。[下载镜像](https://quay.io/repository/ascend/vllm-ascend?tab=tags)"
|
"稠密模型。目前,我们提供一体化镜像。[下载镜像](https://quay.io/repository/ascend/vllm-"
|
||||||
|
"ascend?tab=tags)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:44
|
#: ../../source/tutorials/models/Qwen3-Dense.md:44
|
||||||
msgid "Docker Pull (by tag)"
|
msgid "Docker Pull (by tag)"
|
||||||
@@ -227,18 +216,15 @@ msgid ""
|
|||||||
" (`pip install -e`) to help developer immediately take place changes "
|
" (`pip install -e`) to help developer immediately take place changes "
|
||||||
"without requiring a new installation."
|
"without requiring a new installation."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"默认工作目录是 `/workspace`,vLLM 和 vLLM Ascend 代码放置在 `/vllm-"
|
"默认工作目录是 `/workspace`,vLLM 和 vLLM Ascend 代码放置在 `/vllm-workspace` 中,并以 "
|
||||||
"workspace` 中,并以 [开发模式](https://setuptools.pypa.io/en/latest/"
|
"[开发模式](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)"
|
||||||
"userguide/development_mode.html) (`pip install -e`) 安装,以帮助开发者立即应用"
|
" (`pip install -e`) 安装,以帮助开发者立即应用更改而无需重新安装。"
|
||||||
"更改而无需重新安装。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:92
|
#: ../../source/tutorials/models/Qwen3-Dense.md:92
|
||||||
msgid ""
|
msgid ""
|
||||||
"In the [Run docker container](./Qwen3-Dense.md#run-docker-container), "
|
"In the [Run docker container](./Qwen3-Dense.md#run-docker-container), "
|
||||||
"detailed explanations are provided through specific examples."
|
"detailed explanations are provided through specific examples."
|
||||||
msgstr ""
|
msgstr "在 [运行 docker 容器](./Qwen3-Dense.md#run-docker-container) 中,通过具体示例提供了详细说明。"
|
||||||
"在 [运行 docker 容器](./Qwen3-Dense.md#run-docker-container) 中,通过具体示例"
|
|
||||||
"提供了详细说明。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:94
|
#: ../../source/tutorials/models/Qwen3-Dense.md:94
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -273,11 +259,10 @@ msgid ""
|
|||||||
"max_num_batched_tokens, and cudagraph_capture_sizes, to achieve the best "
|
"max_num_batched_tokens, and cudagraph_capture_sizes, to achieve the best "
|
||||||
"performance."
|
"performance."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"在本节中,我们将演示在 vLLM-Ascend 中调整超参数以实现最大推理吞吐性能的最佳实"
|
"在本节中,我们将演示在 vLLM-Ascend "
|
||||||
"践。通过定制服务级配置以适应不同的用例,您可以确保您的系统在各种场景下都能达"
|
"中调整超参数以实现最大推理吞吐性能的最佳实践。通过定制服务级配置以适应不同的用例,您可以确保您的系统在各种场景下都能达到最佳性能。我们将指导您如何根据观察到的现象(例如"
|
||||||
"到最佳性能。我们将指导您如何根据观察到的现象(例如 max_model_len、"
|
" max_model_len、max_num_batched_tokens 和 "
|
||||||
"max_num_batched_tokens 和 cudagraph_capture_sizes)来微调超参数,以获得最佳性"
|
"cudagraph_capture_sizes)来微调超参数,以获得最佳性能。"
|
||||||
"能。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:104
|
#: ../../source/tutorials/models/Qwen3-Dense.md:104
|
||||||
msgid "The specific example scenario is as follows:"
|
msgid "The specific example scenario is as follows:"
|
||||||
@@ -364,11 +349,9 @@ msgid ""
|
|||||||
" these scenarios and this parameter will be removed."
|
" these scenarios and this parameter will be removed."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"**[可选]** `--additional-config '{\"pa_shape_list\":[48,64,72,80]}'`: "
|
"**[可选]** `--additional-config '{\"pa_shape_list\":[48,64,72,80]}'`: "
|
||||||
"`pa_shape_list` 指定了您希望切换到 PA 算子的批次大小。这是一个临时的调优旋"
|
"`pa_shape_list` 指定了您希望切换到 PA 算子的批次大小。这是一个临时的调优旋钮。目前,注意力算子调度默认使用 FIA "
|
||||||
"钮。目前,注意力算子调度默认使用 FIA 算子。在某些批次大小(并发)设置下,FIA "
|
"算子。在某些批次大小(并发)设置下,FIA 可能性能不佳。通过设置 `pa_shape_list`,当运行时批次大小与列出的值之一匹配时"
|
||||||
"可能性能不佳。通过设置 `pa_shape_list`,当运行时批次大小与列出的值之一匹配时,"
|
",vLLM-Ascend 将用 PA 算子替换 FIA 算子以防止性能下降。未来,FIA 将针对这些场景进行优化,此参数将被移除。"
|
||||||
"vLLM-Ascend 将用 PA 算子替换 FIA 算子以防止性能下降。未来,FIA 将针对这些场景"
|
|
||||||
"进行优化,此参数将被移除。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:198
|
#: ../../source/tutorials/models/Qwen3-Dense.md:198
|
||||||
#, python-brace-format
|
#, python-brace-format
|
||||||
@@ -381,10 +364,10 @@ msgid ""
|
|||||||
"\"FULL_DECODE_ONLY\", "
|
"\"FULL_DECODE_ONLY\", "
|
||||||
"\"cudagraph_capture_sizes\":[1,8,24,48,60,64,72,76]}'`."
|
"\"cudagraph_capture_sizes\":[1,8,24,48,60,64,72,76]}'`."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"如果需要极致性能,可以启用 cudagraph_capture_sizes 参数,参考:[关键优化"
|
"如果需要极致性能,可以启用 cudagraph_capture_sizes 参数,参考:[关键优化点](./Qwen3-Dense.md#key-"
|
||||||
"点](./Qwen3-Dense.md#key-optimization-points)、[优化亮点](./Qwen3-"
|
"optimization-points)、[优化亮点](./Qwen3-Dense.md#optimization-"
|
||||||
"Dense.md#optimization-highlights)。以下是批次大小为 72 的示例:`--compilation-"
|
"highlights)。以下是批次大小为 72 的示例:`--compilation-config '{\"cudagraph_mode\": "
|
||||||
"config '{\"cudagraph_mode\": \"FULL_DECODE_ONLY\", "
|
"\"FULL_DECODE_ONLY\", "
|
||||||
"\"cudagraph_capture_sizes\":[1,8,24,48,60,64,72,76]}'`。"
|
"\"cudagraph_capture_sizes\":[1,8,24,48,60,64,72,76]}'`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:201
|
#: ../../source/tutorials/models/Qwen3-Dense.md:201
|
||||||
@@ -423,7 +406,7 @@ msgid ""
|
|||||||
"Refer to [Using "
|
"Refer to [Using "
|
||||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||||
"details."
|
"details."
|
||||||
msgstr "详情请参阅[使用AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:273
|
#: ../../source/tutorials/models/Qwen3-Dense.md:273
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -512,11 +495,13 @@ msgid ""
|
|||||||
"Refer to [Using AISBench for performance "
|
"Refer to [Using AISBench for performance "
|
||||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
"performance-evaluation) for details."
|
"performance-evaluation) for details."
|
||||||
msgstr "详情请参阅[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
msgstr ""
|
||||||
|
"详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md"
|
||||||
|
"#execute-performance-evaluation)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:287
|
#: ../../source/tutorials/models/Qwen3-Dense.md:287
|
||||||
msgid "Using vLLM Benchmark"
|
msgid "Using vLLM Benchmark"
|
||||||
msgstr "使用vLLM基准测试"
|
msgstr "使用 vLLM 基准测试"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:289
|
#: ../../source/tutorials/models/Qwen3-Dense.md:289
|
||||||
msgid "Run performance evaluation of `Qwen3-32B-W8A8` as an example."
|
msgid "Run performance evaluation of `Qwen3-32B-W8A8` as an example."
|
||||||
@@ -526,7 +511,7 @@ msgstr "以运行 `Qwen3-32B-W8A8` 的性能评估为例。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
||||||
"for more details."
|
"for more details."
|
||||||
msgstr "更多详情请参阅[vllm基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
msgstr "更多详情请参阅 [vLLM 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:293
|
#: ../../source/tutorials/models/Qwen3-Dense.md:293
|
||||||
msgid "There are three `vllm bench` subcommands:"
|
msgid "There are three `vllm bench` subcommands:"
|
||||||
@@ -564,11 +549,11 @@ msgid ""
|
|||||||
"significantly improve the performance of Qwen Dense models. These "
|
"significantly improve the performance of Qwen Dense models. These "
|
||||||
"techniques are designed to enhance throughput and efficiency across "
|
"techniques are designed to enhance throughput and efficiency across "
|
||||||
"various scenarios."
|
"various scenarios."
|
||||||
msgstr "本节将介绍能显著提升Qwen Dense模型性能的关键优化点。这些技术旨在提升各种场景下的吞吐量和效率。"
|
msgstr "本节将介绍能显著提升 Qwen Dense 模型性能的关键优化点。这些技术旨在提升各种场景下的吞吐量和效率。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:316
|
#: ../../source/tutorials/models/Qwen3-Dense.md:316
|
||||||
msgid "1. Rope Optimization"
|
msgid "1. Rope Optimization"
|
||||||
msgstr "1. Rope优化"
|
msgstr "1. Rope 优化"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:318
|
#: ../../source/tutorials/models/Qwen3-Dense.md:318
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -578,7 +563,9 @@ msgid ""
|
|||||||
"performed during the first layer of the forward pass. For subsequent "
|
"performed during the first layer of the forward pass. For subsequent "
|
||||||
"layers, the position encoding is directly reused, eliminating redundant "
|
"layers, the position encoding is directly reused, eliminating redundant "
|
||||||
"calculations and significantly speeding up inference in decode phase."
|
"calculations and significantly speeding up inference in decode phase."
|
||||||
msgstr "Rope优化通过修改位置编码过程来提升模型效率。具体来说,它确保 `cos_sin_cache` 及相关索引选择操作仅在正向传播的第一层执行。对于后续层,位置编码被直接复用,消除了冗余计算,并显著加快了解码阶段的推理速度。"
|
msgstr ""
|
||||||
|
"Rope 优化通过修改位置编码过程来提升模型效率。具体来说,它确保 `cos_sin_cache` "
|
||||||
|
"及相关索引选择操作仅在正向传播的第一层执行。对于后续层,位置编码被直接复用,消除了冗余计算,并显著加快了解码阶段的推理速度。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:320
|
#: ../../source/tutorials/models/Qwen3-Dense.md:320
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:326
|
#: ../../source/tutorials/models/Qwen3-Dense.md:326
|
||||||
@@ -590,14 +577,14 @@ msgstr "此优化默认启用,无需设置任何额外的环境变量。"
|
|||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:322
|
#: ../../source/tutorials/models/Qwen3-Dense.md:322
|
||||||
msgid "2. AddRMSNormQuant Fusion"
|
msgid "2. AddRMSNormQuant Fusion"
|
||||||
msgstr "2. AddRMSNormQuant融合"
|
msgstr "2. AddRMSNormQuant 融合"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:324
|
#: ../../source/tutorials/models/Qwen3-Dense.md:324
|
||||||
msgid ""
|
msgid ""
|
||||||
"AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization "
|
"AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization "
|
||||||
"and Quantization operations, allowing for more efficient memory access "
|
"and Quantization operations, allowing for more efficient memory access "
|
||||||
"and computation, thereby enhancing throughput."
|
"and computation, thereby enhancing throughput."
|
||||||
msgstr "AddRMSNormQuant融合将地址感知多尺度归一化与量化操作合并,实现了更高效的内存访问和计算,从而提升了吞吐量。"
|
msgstr "AddRMSNormQuant 融合将地址感知多尺度归一化与量化操作合并,实现了更高效的内存访问和计算,从而提升了吞吐量。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:328
|
#: ../../source/tutorials/models/Qwen3-Dense.md:328
|
||||||
msgid "3. FlashComm_v1"
|
msgid "3. FlashComm_v1"
|
||||||
@@ -612,7 +599,9 @@ msgid ""
|
|||||||
"processing. In quantization scenarios, FlashComm_v1 also reduces the "
|
"processing. In quantization scenarios, FlashComm_v1 also reduces the "
|
||||||
"communication overhead by decreasing the bit-level data transfer, which "
|
"communication overhead by decreasing the bit-level data transfer, which "
|
||||||
"further minimizes the end-to-end latency during the prefill phase."
|
"further minimizes the end-to-end latency during the prefill phase."
|
||||||
msgstr "FlashComm_v1通过将传统的allreduce集合通信分解为reduce-scatter和all-gather,显著提升了大批量场景下的性能。这种分解有助于减少RMSNorm令牌维度的计算,从而实现更高效的处理。在量化场景中,FlashComm_v1还通过减少比特级数据传输来降低通信开销,从而进一步最小化预填充阶段的端到端延迟。"
|
msgstr ""
|
||||||
|
"FlashComm_v1 通过将传统的 allreduce 集合通信分解为 reduce-scatter 和 all-"
|
||||||
|
"gather,显著提升了大批量场景下的性能。这种分解有助于减少 RMSNorm 令牌维度的计算,从而实现更高效的处理。在量化场景中,FlashComm_v1 还通过减少比特级数据传输来降低通信开销,从而进一步最小化预填充阶段的端到端延迟。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:332
|
#: ../../source/tutorials/models/Qwen3-Dense.md:332
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -626,7 +615,9 @@ msgid ""
|
|||||||
"exceeds the threshold. This ensures that the feature is only activated in"
|
"exceeds the threshold. This ensures that the feature is only activated in"
|
||||||
" scenarios where it improves performance, avoiding potential degradation "
|
" scenarios where it improves performance, avoiding potential degradation "
|
||||||
"in lower-concurrency situations."
|
"in lower-concurrency situations."
|
||||||
msgstr "需要注意的是,将allreduce通信分解为reduce-scatter和all-gather操作仅在无显著通信降级的高并发场景下有益。在其他情况下,这种分解可能导致明显的性能下降。为缓解此问题,当前实现采用基于阈值的方法,仅当每个推理调度的实际令牌数超过阈值时才启用FlashComm_v1。这确保了该功能仅在能提升性能的场景下激活,避免了在低并发情况下可能出现的性能下降。"
|
msgstr ""
|
||||||
|
"需要注意的是,将 allreduce 通信分解为 reduce-scatter 和 all-"
|
||||||
|
"gather 操作仅在无显著通信降级的高并发场景下有益。在其他情况下,这种分解可能导致明显的性能下降。为缓解此问题,当前实现采用基于阈值的方法,仅当每个推理调度的实际令牌数超过阈值时才启用 FlashComm_v1。这确保了该功能仅在能提升性能的场景下激活,避免了在低并发情况下可能出现的性能下降。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:334
|
#: ../../source/tutorials/models/Qwen3-Dense.md:334
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -636,7 +627,7 @@ msgstr "此优化需要设置环境变量 `VLLM_ASCEND_ENABLE_FLASHCOMM1 = 1`
|
|||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:336
|
#: ../../source/tutorials/models/Qwen3-Dense.md:336
|
||||||
msgid "4. Matmul and ReduceScatter Fusion"
|
msgid "4. Matmul and ReduceScatter Fusion"
|
||||||
msgstr "4. 矩阵乘法和ReduceScatter融合"
|
msgstr "4. 矩阵乘法和 ReduceScatter 融合"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:338
|
#: ../../source/tutorials/models/Qwen3-Dense.md:338
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -648,7 +639,7 @@ msgid ""
|
|||||||
"communication steps, improves computational efficiency, and allows for "
|
"communication steps, improves computational efficiency, and allows for "
|
||||||
"better resource utilization, resulting in enhanced throughput, especially"
|
"better resource utilization, resulting in enhanced throughput, especially"
|
||||||
" in large-scale distributed environments."
|
" in large-scale distributed environments."
|
||||||
msgstr "一旦启用FlashComm_v1,可以应用额外的优化。此优化融合了矩阵乘法和ReduceScatter操作,并包含分片优化。矩阵乘法计算被视为一个流水线,而ReduceScatter和反量化操作则在另一个独立的流水线中处理。这种方法显著减少了通信步骤,提高了计算效率,并实现了更好的资源利用,从而提升了吞吐量,尤其在大规模分布式环境中效果显著。"
|
msgstr "一旦启用 FlashComm_v1,可以应用额外的优化。此优化融合了矩阵乘法和 ReduceScatter 操作,并包含分片优化。矩阵乘法计算被视为一个流水线,而 ReduceScatter 和反量化操作则在另一个独立的流水线中处理。这种方法显著减少了通信步骤,提高了计算效率,并实现了更好的资源利用,从而提升了吞吐量,尤其在大规模分布式环境中效果显著。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:340
|
#: ../../source/tutorials/models/Qwen3-Dense.md:340
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -658,7 +649,7 @@ msgid ""
|
|||||||
" is currently used to mitigate this problem. The optimization is only "
|
" is currently used to mitigate this problem. The optimization is only "
|
||||||
"applied when the token count exceeds the threshold, ensuring that it is "
|
"applied when the token count exceeds the threshold, ensuring that it is "
|
||||||
"not enabled in cases where it could negatively impact performance."
|
"not enabled in cases where it could negatively impact performance."
|
||||||
msgstr "此优化在FlashComm_v1激活后会自动启用。然而,由于融合后在小并发场景下存在性能下降的问题,目前采用基于阈值的方法来缓解此问题。该优化仅在令牌数超过阈值时应用,确保在可能对性能产生负面影响的情况下不被启用。"
|
msgstr "此优化在 FlashComm_v1 激活后会自动启用。然而,由于融合后在小并发场景下存在性能下降的问题,目前采用基于阈值的方法来缓解此问题。该优化仅在令牌数超过阈值时应用,确保在可能对性能产生负面影响的情况下不被启用。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:342
|
#: ../../source/tutorials/models/Qwen3-Dense.md:342
|
||||||
msgid "5. Weight Prefetching"
|
msgid "5. Weight Prefetching"
|
||||||
@@ -681,7 +672,7 @@ msgid ""
|
|||||||
"preloaded to L2 cache ahead of time, reducing MTE utilization during the "
|
"preloaded to L2 cache ahead of time, reducing MTE utilization during the "
|
||||||
"MLP computations and indirectly improving Cube computation efficiency by "
|
"MLP computations and indirectly improving Cube computation efficiency by "
|
||||||
"minimizing resource contention and optimizing data flow."
|
"minimizing resource contention and optimizing data flow."
|
||||||
msgstr "在稠密模型场景中,MLP的gate_up_proj和down_proj线性层通常表现出相对较高的MTE利用率。为解决此问题,我们创建了一个专门用于权重预取的独立流水线,该流水线与MLP之前的原始向量计算流水线(如RMSNorm和SiLU)并行运行。这种方法允许权重提前预加载到L2缓存中,从而降低MLP计算期间的MTE利用率,并通过最小化资源争用和优化数据流,间接提升Cube计算效率。"
|
msgstr "在稠密模型场景中,MLP 的 gate_up_proj 和 down_proj 线性层通常表现出相对较高的 MTE 利用率。为解决此问题,我们创建了一个专门用于权重预取的独立流水线,该流水线与 MLP 之前的原始向量计算流水线(如 RMSNorm 和 SiLU)并行运行。这种方法允许权重提前预加载到 L2 缓存中,从而降低 MLP 计算期间的 MTE 利用率,并通过最小化资源争用和优化数据流,间接提升 Cube 计算效率。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:348
|
#: ../../source/tutorials/models/Qwen3-Dense.md:348
|
||||||
#, python-brace-format
|
#, python-brace-format
|
||||||
@@ -695,11 +686,17 @@ msgid ""
|
|||||||
"\"enabled\": true, \"prefetch_ratio\": { \"mlp\": { \"gate_up\": 1.0, "
|
"\"enabled\": true, \"prefetch_ratio\": { \"mlp\": { \"gate_up\": 1.0, "
|
||||||
"\"down\": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide "
|
"\"down\": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide "
|
||||||
"for details."
|
"for details."
|
||||||
msgstr "之前用于启用MLP权重预取的环境变量 `VLLM_ASCEND_ENABLE_PREFETCH_MLP`,以及用于设置MLP gate_up_proj和down_proj权重预取大小的 `VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE` 和 `VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE` 已被弃用。请改用以下配置:`\"weight_prefetch_config\": { \"enabled\": true, \"prefetch_ratio\": { \"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}}`。详情请参阅用户指南->功能指南->权重预取指南。"
|
msgstr ""
|
||||||
|
"此前用于启用MLP权重预取的环境变量 `VLLM_ASCEND_ENABLE_PREFETCH_MLP`,以及用于设置MLP "
|
||||||
|
"gate_up_proj和down_proj权重预取大小的 `VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE` 和 "
|
||||||
|
"`VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE` "
|
||||||
|
"已被弃用。请改用以下配置:`\"weight_prefetch_config\": { \"enabled\": true, "
|
||||||
|
"\"prefetch_ratio\": { \"mlp\": { \"gate_up\": 1.0, \"down\": "
|
||||||
|
"1.0}}}`。详情请参阅用户指南->功能指南->权重预取指南。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:350
|
#: ../../source/tutorials/models/Qwen3-Dense.md:350
|
||||||
msgid "6. Zerolike Elimination"
|
msgid "6. Zerolike Elimination"
|
||||||
msgstr "6. Zerolike消除"
|
msgstr "6. 类零消除"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:352
|
#: ../../source/tutorials/models/Qwen3-Dense.md:352
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -731,7 +728,9 @@ msgid ""
|
|||||||
"The configuration compilation_config = { \"cudagraph_mode\": "
|
"The configuration compilation_config = { \"cudagraph_mode\": "
|
||||||
"\"FULL_DECODE_ONLY\"} is used when starting the service. This setup is "
|
"\"FULL_DECODE_ONLY\"} is used when starting the service. This setup is "
|
||||||
"necessary to enable the aclgraph's full decode-only mode."
|
"necessary to enable the aclgraph's full decode-only mode."
|
||||||
msgstr "启动服务时使用配置 `compilation_config = { \"cudagraph_mode\": \"FULL_DECODE_ONLY\"}`。此设置对于启用aclgraph的完全仅解码模式是必需的。"
|
msgstr ""
|
||||||
|
"启动服务时使用配置 `compilation_config = { \"cudagraph_mode\": "
|
||||||
|
"\"FULL_DECODE_ONLY\"}`。此设置对于启用aclgraph的完全仅解码模式是必需的。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:362
|
#: ../../source/tutorials/models/Qwen3-Dense.md:362
|
||||||
msgid "8. Asynchronous Scheduling"
|
msgid "8. Asynchronous Scheduling"
|
||||||
@@ -785,13 +784,11 @@ msgid ""
|
|||||||
"18MB. The reason for this is that, at this value, the vector computations"
|
"18MB. The reason for this is that, at this value, the vector computations"
|
||||||
" of RMSNorm and SiLU can effectively hide the prefetch stream, thereby "
|
" of RMSNorm and SiLU can effectively hide the prefetch stream, thereby "
|
||||||
"accelerating the Matmul computations of the two linear layers."
|
"accelerating the Matmul computations of the two linear layers."
|
||||||
msgstr ""
|
msgstr "例如,在上述实际场景中,我将MLP中gate_up_proj和down_proj的预取缓冲区大小设置为18MB。这样做的原因是,在此数值下,RMSNorm和SiLU的向量计算能够有效隐藏预取流,从而加速两个线性层的Matmul计算。"
|
||||||
"例如,在上述实际场景中,我将MLP中gate_up_proj和down_proj的预取缓冲区大小设置为18MB。"
|
|
||||||
"这样做的原因是,在此数值下,RMSNorm和SiLU的向量计算能够有效隐藏预取流,从而加速两个线性层的Matmul计算。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:378
|
#: ../../source/tutorials/models/Qwen3-Dense.md:378
|
||||||
msgid "2.Max-num-batched-tokens"
|
msgid "2.Max-num-batched-tokens"
|
||||||
msgstr "2.最大批处理令牌数"
|
msgstr "2. 最大批处理令牌数"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:380
|
#: ../../source/tutorials/models/Qwen3-Dense.md:380
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -802,24 +799,22 @@ msgid ""
|
|||||||
"processed per batch, potentially leading to inefficiencies. Conversely, "
|
"processed per batch, potentially leading to inefficiencies. Conversely, "
|
||||||
"setting it too large increases the risk of Out of Memory (OOM) errors due"
|
"setting it too large increases the risk of Out of Memory (OOM) errors due"
|
||||||
" to excessive memory consumption."
|
" to excessive memory consumption."
|
||||||
msgstr ""
|
msgstr "最大批处理令牌数参数决定了单批次可处理的令牌数量上限。调整此值有助于平衡吞吐量与内存使用。若设置过小,每批次处理的令牌数较少,可能降低效率,从而对端到端性能产生负面影响。反之,若设置过大,则会因内存消耗过高而增加内存溢出(OOM)错误的风险。"
|
||||||
"最大批处理令牌数参数决定了单批次可处理的令牌数量上限。调整此值有助于平衡吞吐量与内存使用。"
|
|
||||||
"若设置过小,每批次处理的令牌数较少,可能降低效率,从而对端到端性能产生负面影响。"
|
|
||||||
"反之,若设置过大,则会因内存消耗过高而增加内存溢出(OOM)错误的风险。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:382
|
#: ../../source/tutorials/models/Qwen3-Dense.md:382
|
||||||
msgid ""
|
msgid ""
|
||||||
"In the above real-world scenario, we not only conducted extensive testing"
|
"In the above real-world scenario, we not only conducted extensive testing"
|
||||||
" to determine the most cost-effective value, but also took into account "
|
" to determine the most cost-effective value, but also took into account "
|
||||||
"the accumulation of decode tokens when enabling chunked prefill. If the "
|
"the accumulation of decode tokens when enabling chunked prefill. If the "
|
||||||
"value is set too small, a single request may被分块多次,并且在推理的早期阶段,一个批次可能只包含少量解码令牌。这可能导致端到端吞吐量达不到预期。"
|
"value is set too small, a single request may be chunked multiple times, "
|
||||||
msgstr ""
|
"and during the early stages of inference, a batch may contain only a "
|
||||||
"在上述实际场景中,我们不仅通过大量测试确定了最具性价比的数值,还考虑了启用分块预填充时解码令牌的累积问题。"
|
"small number of decode tokens. This can result in the end-to-end "
|
||||||
"若该值设置过小,单个请求可能被多次分块处理,且在推理早期阶段,单个批次可能仅包含少量解码令牌,从而导致端到端吞吐量无法达到预期。"
|
"throughput falling short of expectations."
|
||||||
|
msgstr "在上述实际场景中,我们不仅通过大量测试确定了最具性价比的数值,还考虑了启用分块预填充时解码令牌的累积问题。若该值设置过小,单个请求可能被多次分块处理,且在推理早期阶段,单个批次可能仅包含少量解码令牌,从而导致端到端吞吐量无法达到预期。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:384
|
#: ../../source/tutorials/models/Qwen3-Dense.md:384
|
||||||
msgid "3.Cudagraph_capture_sizes"
|
msgid "3.Cudagraph_capture_sizes"
|
||||||
msgstr "3.CUDA图捕获尺寸"
|
msgstr "3. CUDA图捕获尺寸"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:386
|
#: ../../source/tutorials/models/Qwen3-Dense.md:386
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -827,8 +822,7 @@ msgid ""
|
|||||||
"captures during the inference process. Adjusting this value determines "
|
"captures during the inference process. Adjusting this value determines "
|
||||||
"how much of the computation graph is captured at once, which can "
|
"how much of the computation graph is captured at once, which can "
|
||||||
"significantly impact both performance and memory usage."
|
"significantly impact both performance and memory usage."
|
||||||
msgstr ""
|
msgstr "CUDA图捕获尺寸参数控制推理过程中图捕获的粒度。调整此值决定了单次捕获的计算图范围,这对性能和内存使用均有显著影响。"
|
||||||
"CUDA图捕获尺寸参数控制推理过程中图捕获的粒度。调整此值决定了单次捕获的计算图范围,这对性能和内存使用均有显著影响。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:388
|
#: ../../source/tutorials/models/Qwen3-Dense.md:388
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -839,9 +833,7 @@ msgid ""
|
|||||||
" between two sizes, the framework will automatically pad the token count "
|
" between two sizes, the framework will automatically pad the token count "
|
||||||
"to the larger size. This often leads to actual performance deviating from"
|
"to the larger size. This often leads to actual performance deviating from"
|
||||||
" the expected or even degrading."
|
" the expected or even degrading."
|
||||||
msgstr ""
|
msgstr "若未手动指定此列表,系统将自动填充一系列均匀分布的值,这通常能保证良好性能。但若需进一步微调,手动指定数值将获得更佳效果。这是因为当批次大小介于两个尺寸之间时,框架会自动将令牌数填充至较大尺寸,这常导致实际性能偏离预期甚至下降。"
|
||||||
"若未手动指定此列表,系统将自动填充一系列均匀分布的值,这通常能保证良好性能。"
|
|
||||||
"但若需进一步微调,手动指定数值将获得更佳效果。这是因为当批次大小介于两个尺寸之间时,框架会自动将令牌数填充至较大尺寸,这常导致实际性能偏离预期甚至下降。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:390
|
#: ../../source/tutorials/models/Qwen3-Dense.md:390
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -850,9 +842,7 @@ msgid ""
|
|||||||
"actually included in the cudagraph_capture_sizes list. This way, during "
|
"actually included in the cudagraph_capture_sizes list. This way, during "
|
||||||
"the decode phase, padding operations are essentially avoided, ensuring "
|
"the decode phase, padding operations are essentially avoided, ensuring "
|
||||||
"the reliability of the experimental data."
|
"the reliability of the experimental data."
|
||||||
msgstr ""
|
msgstr "因此,如上述实际场景所示,在调整基准测试请求并发度时,我们始终确保并发度实际包含在CUDA图捕获尺寸列表中。这样在解码阶段基本避免了填充操作,从而保证了实验数据的可靠性。"
|
||||||
"因此,如上述实际场景所示,在调整基准测试请求并发度时,我们始终确保并发度实际包含在CUDA图捕获尺寸列表中。"
|
|
||||||
"这样在解码阶段基本避免了填充操作,从而保证了实验数据的可靠性。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Dense.md:392
|
#: ../../source/tutorials/models/Qwen3-Dense.md:392
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -861,6 +851,4 @@ msgid ""
|
|||||||
"not meet this condition will be automatically filtered out. Therefore, I "
|
"not meet this condition will be automatically filtered out. Therefore, I "
|
||||||
"recommend incrementally adding concurrency based on the TP size after "
|
"recommend incrementally adding concurrency based on the TP size after "
|
||||||
"enabling FlashComm_v1."
|
"enabling FlashComm_v1."
|
||||||
msgstr ""
|
msgstr "需特别注意,若启用FlashComm_v1,此列表中的值必须是TP尺寸的整数倍。不满足此条件的任何值都将被自动过滤。因此,建议在启用FlashComm_v1后,基于TP尺寸逐步增加并发度。"
|
||||||
"需特别注意,若启用FlashComm_v1,此列表中的值必须是TP尺寸的整数倍。不满足此条件的任何值都将被自动过滤。"
|
|
||||||
"因此,建议在启用FlashComm_v1后,基于TP尺寸逐步增加并发度。"
|
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -37,7 +37,9 @@ msgid ""
|
|||||||
"equipped with chain-of-thought reasoning, supporting audio, video, and "
|
"equipped with chain-of-thought reasoning, supporting audio, video, and "
|
||||||
"text input, with text output."
|
"text input, with text output."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"Qwen3-Omni 是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。"
|
"Qwen3-Omni "
|
||||||
|
"是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3"
|
||||||
|
"-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:7
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:7
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -55,14 +57,18 @@ msgid ""
|
|||||||
"Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-"
|
"Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-"
|
||||||
"cn/latest/user_guide/support_matrix/supported_models.html) to get the "
|
"cn/latest/user_guide/support_matrix/supported_models.html) to get the "
|
||||||
"model's supported feature matrix."
|
"model's supported feature matrix."
|
||||||
msgstr "请参考 [支持的功能](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) 以获取模型支持的功能矩阵。"
|
msgstr ""
|
||||||
|
"请参考 [支持的功能](https://docs.vllm.ai/projects/ascend/zh-"
|
||||||
|
"cn/latest/user_guide/support_matrix/supported_models.html) 以获取模型支持的功能矩阵。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:13
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:13
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-"
|
"Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-"
|
||||||
"cn/latest/user_guide/feature_guide/index.html) to get the feature's "
|
"cn/latest/user_guide/feature_guide/index.html) to get the feature's "
|
||||||
"configuration."
|
"configuration."
|
||||||
msgstr "请参考 [功能指南](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) 以获取功能的配置信息。"
|
msgstr ""
|
||||||
|
"请参考 [功能指南](https://docs.vllm.ai/projects/ascend/zh-"
|
||||||
|
"cn/latest/user_guide/feature_guide/index.html) 以获取功能的配置信息。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:15
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:15
|
||||||
msgid "Environment Preparation"
|
msgid "Environment Preparation"
|
||||||
@@ -74,18 +80,20 @@ msgstr "模型权重"
|
|||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:19
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:19
|
||||||
msgid ""
|
msgid ""
|
||||||
"`Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards(64G × 2).[Download "
|
"`Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards (64G × 2).[Download "
|
||||||
"model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
"model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
||||||
"Thinking) It is recommended to download the model weight to the shared "
|
"Thinking) It is recommended to download the model weight to the shared "
|
||||||
"directory of multiple nodes, such as `/root/.cache/`"
|
"directory of multiple nodes, such as `/root/.cache/`"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"`Qwen3-Omni-30B-A3B-Thinking` 需要 2 张 NPU 卡 (64G × 2)。[下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)。建议将模型权重下载到多节点的共享目录,例如 `/root/.cache/`。"
|
"`Qwen3-Omni-30B-A3B-Thinking` 需要 2 张 NPU 卡 (64G × "
|
||||||
|
"2)。[下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
||||||
|
"Thinking)。建议将模型权重下载到多节点的共享目录,例如 `/root/.cache/`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:22
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:22
|
||||||
msgid "Installation"
|
msgid "Installation"
|
||||||
msgstr "安装"
|
msgstr "安装"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:24
|
||||||
msgid "Use docker image"
|
msgid "Use docker image"
|
||||||
msgstr "使用 Docker 镜像"
|
msgstr "使用 Docker 镜像"
|
||||||
|
|
||||||
@@ -100,9 +108,11 @@ msgid ""
|
|||||||
"Select an image based on your machine type and start the docker image on "
|
"Select an image based on your machine type and start the docker image on "
|
||||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||||
"docker)."
|
"docker)."
|
||||||
msgstr "根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 [使用 Docker](../../installation.md#set-up-using-docker)。"
|
msgstr ""
|
||||||
|
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 [使用 Docker](../../installation.md#set-"
|
||||||
|
"up-using-docker)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:32
|
||||||
msgid "Build from source"
|
msgid "Build from source"
|
||||||
msgstr "从源码构建"
|
msgstr "从源码构建"
|
||||||
|
|
||||||
@@ -114,7 +124,9 @@ msgstr "您可以从源码构建所有组件。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"Install `vllm-ascend`, refer to [set up using "
|
"Install `vllm-ascend`, refer to [set up using "
|
||||||
"python](../../installation.md#set-up-using-python)."
|
"python](../../installation.md#set-up-using-python)."
|
||||||
msgstr "安装 `vllm-ascend`,请参考 [使用 Python 设置](../../installation.md#set-up-using-python)。"
|
msgstr ""
|
||||||
|
"安装 `vllm-ascend`,请参考 [使用 Python 设置](../../installation.md#set-up-using-"
|
||||||
|
"python)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:71
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:71
|
||||||
msgid "Please install system dependencies"
|
msgid "Please install system dependencies"
|
||||||
@@ -146,7 +158,9 @@ msgid ""
|
|||||||
"Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at"
|
"Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at"
|
||||||
" least 1, and for 32 GB of memory, tensor-parallel-size should be at "
|
" least 1, and for 32 GB of memory, tensor-parallel-size should be at "
|
||||||
"least 2."
|
"least 2."
|
||||||
msgstr "运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。"
|
msgstr ""
|
||||||
|
"运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-"
|
||||||
|
"size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:188
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:188
|
||||||
msgid "Functional Verification"
|
msgid "Functional Verification"
|
||||||
@@ -173,25 +187,31 @@ msgid ""
|
|||||||
"As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test "
|
"As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test "
|
||||||
"dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in "
|
"dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in "
|
||||||
"online mode."
|
"online mode."
|
||||||
msgstr "以 `gsm8k`、`omnibench`、`bbh` 数据集作为测试数据集为例,在在线模式下运行 `Qwen3-Omni-30B-A3B-Thinking` 的精度评估。"
|
msgstr ""
|
||||||
|
"以 `gsm8k`、`omnibench`、`bbh` 数据集作为测试数据集为例,在在线模式下运行 `Qwen3-Omni-30B-A3B-"
|
||||||
|
"Thinking` 的精度评估。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:239
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:239
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to Using "
|
"Refer to Using "
|
||||||
"evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
"evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
||||||
"#install-evalscope-using-pip>) for `evalscope`installation."
|
"#install-evalscope-using-pip>) for `evalscope`installation."
|
||||||
msgstr "关于 `evalscope` 的安装,请参考使用 evalscope (<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>)。"
|
msgstr ""
|
||||||
|
"关于 `evalscope` 的安装,请参考使用 evalscope "
|
||||||
|
"(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
||||||
|
"#install-evalscope-using-pip>)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:240
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:240
|
||||||
msgid "Run `evalscope` to execute the accuracy evaluation."
|
msgid "Run `evalscope` to execute the accuracy evaluation."
|
||||||
msgstr "运行 `evalscope` 以执行精度评估。"
|
msgstr "运行 `evalscope` 以执行精度评估。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:255
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:255
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:296
|
|
||||||
msgid ""
|
msgid ""
|
||||||
"After execution, you can get the result, here is the result of `Qwen3"
|
"After execution, you can get the result, here is the result of `Qwen3"
|
||||||
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
||||||
msgstr "执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 中的结果,仅供参考。"
|
msgstr ""
|
||||||
|
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 "
|
||||||
|
"中的结果,仅供参考。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:269
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:269
|
||||||
msgid "Performance"
|
msgid "Performance"
|
||||||
@@ -207,7 +227,9 @@ msgid ""
|
|||||||
"example. Refer to vllm benchmark for more details. Refer to [vllm "
|
"example. Refer to vllm benchmark for more details. Refer to [vllm "
|
||||||
"benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more "
|
"benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more "
|
||||||
"details."
|
"details."
|
||||||
msgstr "以运行 `Qwen3-Omni-30B-A3B-Thinking` 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 [vllm 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
msgstr ""
|
||||||
|
"以运行 `Qwen3-Omni-30B-A3B-Thinking` 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 [vllm"
|
||||||
|
" 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:277
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:277
|
||||||
msgid "There are three `vllm bench` subcommands:"
|
msgid "There are three `vllm bench` subcommands:"
|
||||||
@@ -228,3 +250,11 @@ msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
|||||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:283
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:283
|
||||||
msgid "Take the `serve` as an example. Run the code as follows."
|
msgid "Take the `serve` as an example. Run the code as follows."
|
||||||
msgstr "以 `serve` 为例。按如下方式运行代码。"
|
msgstr "以 `serve` 为例。按如下方式运行代码。"
|
||||||
|
|
||||||
|
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:296
|
||||||
|
msgid ""
|
||||||
|
"After execution, you can get the result, here is the result of `Qwen3"
|
||||||
|
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
||||||
|
msgstr ""
|
||||||
|
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 "
|
||||||
|
"中的结果,仅供参考。"
|
||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -79,7 +79,10 @@ msgid ""
|
|||||||
"`Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) "
|
"`Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) "
|
||||||
"nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model "
|
"nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model "
|
||||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)"
|
"weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)"
|
||||||
msgstr "`Qwen3.5-397B-A17B` (BF16 版本):需要 2 个 Atlas 800 A3 (64G × 16) 节点或 4 个 Atlas 800 A2 (64G × 8) 节点。[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)"
|
msgstr ""
|
||||||
|
"`Qwen3.5-397B-A17B` (BF16 版本):需要 2 个 Atlas 800 A3 (64G × 16) 节点或 4 个 "
|
||||||
|
"Atlas 800 A2 (64G × 8) "
|
||||||
|
"节点。[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:22
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:22
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -87,7 +90,10 @@ msgid ""
|
|||||||
"× 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model "
|
"× 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model "
|
||||||
"weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-"
|
"weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-"
|
||||||
"w8a8-mtp)"
|
"w8a8-mtp)"
|
||||||
msgstr "`Qwen3.5-397B-A17B-w8a8` (量化版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点或 2 个 Atlas 800 A2 (64G × 8) 节点。[下载模型权重](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp)"
|
msgstr ""
|
||||||
|
"`Qwen3.5-397B-A17B-w8a8` (量化版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点或 2 个 "
|
||||||
|
"Atlas 800 A2 (64G × 8) 节点。[下载模型权重](https://www.modelscope.cn/models/Eco-"
|
||||||
|
"Tech/Qwen3.5-397B-A17B-w8a8-mtp)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:24
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:24
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -104,13 +110,15 @@ msgid ""
|
|||||||
"If you want to deploy multi-node environment, you need to verify multi-"
|
"If you want to deploy multi-node environment, you need to verify multi-"
|
||||||
"node communication according to [verify multi-node communication "
|
"node communication according to [verify multi-node communication "
|
||||||
"environment](../../installation.md#verify-multi-node-communication)."
|
"environment](../../installation.md#verify-multi-node-communication)."
|
||||||
msgstr "如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-communication)来验证多节点通信。"
|
msgstr ""
|
||||||
|
"如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||||
|
"communication)来验证多节点通信。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:30
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:30
|
||||||
msgid "Installation"
|
msgid "Installation"
|
||||||
msgstr "安装"
|
msgstr "安装"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:34
|
||||||
msgid "Use docker image"
|
msgid "Use docker image"
|
||||||
msgstr "使用 Docker 镜像"
|
msgstr "使用 Docker 镜像"
|
||||||
|
|
||||||
@@ -119,16 +127,20 @@ msgid ""
|
|||||||
"For example, using images `quay.io/ascend/vllm-ascend:v0.17.0rc1`(for "
|
"For example, using images `quay.io/ascend/vllm-ascend:v0.17.0rc1`(for "
|
||||||
"Atlas 800 A2) and `quay.io/ascend/vllm-ascend:v0.17.0rc1-a3`(for Atlas "
|
"Atlas 800 A2) and `quay.io/ascend/vllm-ascend:v0.17.0rc1-a3`(for Atlas "
|
||||||
"800 A3)."
|
"800 A3)."
|
||||||
msgstr "例如,使用镜像 `quay.io/ascend/vllm-ascend:v0.17.0rc1`(适用于 Atlas 800 A2)和 `quay.io/ascend/vllm-ascend:v0.17.0rc1-a3`(适用于 Atlas 800 A3)。"
|
msgstr ""
|
||||||
|
"例如,使用镜像 `quay.io/ascend/vllm-ascend:v0.17.0rc1`(适用于 Atlas 800 A2)和 "
|
||||||
|
"`quay.io/ascend/vllm-ascend:v0.17.0rc1-a3`(适用于 Atlas 800 A3)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:38
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:38
|
||||||
msgid ""
|
msgid ""
|
||||||
"Select an image based on your machine type and start the docker image on "
|
"Select an image based on your machine type and start the docker image on "
|
||||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||||
"docker)."
|
"docker)."
|
||||||
msgstr "根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考[使用 Docker](../../installation.md#set-up-using-docker)。"
|
msgstr ""
|
||||||
|
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考[使用 Docker](../../installation.md#set-"
|
||||||
|
"up-using-docker)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:76
|
||||||
msgid "Build from source"
|
msgid "Build from source"
|
||||||
msgstr "从源码构建"
|
msgstr "从源码构建"
|
||||||
|
|
||||||
@@ -140,7 +152,9 @@ msgstr "您可以从源码构建所有组件。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"Install `vllm-ascend`, refer to [set up using "
|
"Install `vllm-ascend`, refer to [set up using "
|
||||||
"python](../../installation.md#set-up-using-python)."
|
"python](../../installation.md#set-up-using-python)."
|
||||||
msgstr "安装 `vllm-ascend`,请参考[使用 Python 设置](../../installation.md#set-up-using-python)。"
|
msgstr ""
|
||||||
|
"安装 `vllm-ascend`,请参考[使用 Python 设置](../../installation.md#set-up-using-"
|
||||||
|
"python)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:84
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:84
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -158,39 +172,42 @@ msgstr "单节点部署"
|
|||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:90
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:90
|
||||||
msgid ""
|
msgid ""
|
||||||
"`Qwen3.5-397B-A17B` can be deployed on 2 Atlas 800 A3(64G*16) or 4 Atlas "
|
"`Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16) or 2 "
|
||||||
"800 A2(64G*8). `Qwen3.5-397B-A17B-w8a8` can be deployed on 1 Atlas 800 "
|
"Atlas 800 A2(64G*8), need to start with parameter `--quantization "
|
||||||
"A3(64G*16) or 2 Atlas 800 A2(64G*8), need to start with parameter "
|
"ascend`."
|
||||||
"`--quantization ascend`."
|
msgstr ""
|
||||||
msgstr "`Qwen3.5-397B-A17B` 可以部署在 2 个 Atlas 800 A3(64G*16) 或 4 个 Atlas 800 A2(64G*8) 上。`Qwen3.5-397B-A17B-w8a8` 可以部署在 1 个 Atlas 800 A3(64G*16) 或 2 个 Atlas 800 A2(64G*8) 上,需要使用参数 `--quantization ascend` 启动。"
|
"`Qwen3.5-397B-A17B-w8a8` 可以部署在 1 个 Atlas 800 A3(64G*16) 或 2 个 Atlas 800 "
|
||||||
|
"A2(64G*8) 上,需要使用参数 `--quantization ascend` 启动。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:93
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:92
|
||||||
msgid ""
|
msgid ""
|
||||||
"Run the following script to execute online 128k inference On 1 Atlas 800 "
|
"Run the following script to execute online 128k inference On 1 Atlas 800 "
|
||||||
"A3(64G*16)."
|
"A3(64G*16)."
|
||||||
msgstr "在 1 个 Atlas 800 A3(64G*16) 上运行以下脚本以执行在线 128k 推理。"
|
msgstr "在 1 个 Atlas 800 A3(64G*16) 上运行以下脚本以执行在线 128k 推理。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:134
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:133
|
||||||
msgid "**Notice:**"
|
msgid "**Notice:**"
|
||||||
msgstr "**注意:**"
|
msgstr "**注意:**"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:136
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:135
|
||||||
msgid "The parameters are explained as follows:"
|
msgid "The parameters are explained as follows:"
|
||||||
msgstr "参数解释如下:"
|
msgstr "参数解释如下:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:138
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:137
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--data-parallel-size` 1 and `--tensor-parallel-size` 16 are common "
|
"`--data-parallel-size` 1 and `--tensor-parallel-size` 16 are common "
|
||||||
"settings for data parallelism (DP) and tensor parallelism (TP) sizes."
|
"settings for data parallelism (DP) and tensor parallelism (TP) sizes."
|
||||||
msgstr "`--data-parallel-size` 1 和 `--tensor-parallel-size` 16 是数据并行 (DP) 和张量并行 (TP) 大小的常见设置。"
|
msgstr ""
|
||||||
|
"`--data-parallel-size` 1 和 `--tensor-parallel-size` 16 是数据并行 (DP) 和张量并行 "
|
||||||
|
"(TP) 大小的常见设置。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:139
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:138
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--max-model-len` represents the context length, which is the maximum "
|
"`--max-model-len` represents the context length, which is the maximum "
|
||||||
"value of the input plus output for a single request."
|
"value of the input plus output for a single request."
|
||||||
msgstr "`--max-model-len` 表示上下文长度,即单个请求的输入加输出的最大值。"
|
msgstr "`--max-model-len` 表示上下文长度,即单个请求的输入加输出的最大值。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:140
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:139
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--max-num-seqs` indicates the maximum number of requests that each DP "
|
"`--max-num-seqs` indicates the maximum number of requests that each DP "
|
||||||
"group is allowed to process. If the number of requests sent to the "
|
"group is allowed to process. If the number of requests sent to the "
|
||||||
@@ -199,36 +216,44 @@ msgid ""
|
|||||||
"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
|
"state is also counted in metrics such as TTFT and TPOT. Therefore, when "
|
||||||
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
||||||
"`--data-parallel-size` >= the actual total concurrency."
|
"`--data-parallel-size` >= the actual total concurrency."
|
||||||
msgstr "`--max-num-seqs` 表示每个 DP 组允许处理的最大请求数。如果发送到服务的请求数超过此限制,多余的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= 实际总并发数。"
|
msgstr ""
|
||||||
|
"`--max-num-seqs` 表示每个 DP "
|
||||||
|
"组允许处理的最大请求数。如果发送到服务的请求数超过此限制,多余的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT"
|
||||||
|
" 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= "
|
||||||
|
"实际总并发数。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:141
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:140
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||||
"enables ChunkPrefill/SplitFuse by default, which means:"
|
"enables ChunkPrefill/SplitFuse by default, which means:"
|
||||||
msgstr "`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 ChunkPrefill/SplitFuse,这意味着:"
|
msgstr ""
|
||||||
|
"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||||
|
"ChunkPrefill/SplitFuse,这意味着:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:142
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:141
|
||||||
msgid ""
|
msgid ""
|
||||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||||
"tokens`, it will be divided into multiple rounds of computation according"
|
"tokens`, it will be divided into multiple rounds of computation according"
|
||||||
" to `--max-num-batched-tokens`;"
|
" to `--max-num-batched-tokens`;"
|
||||||
msgstr "(1) 如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens` 被分成多轮计算;"
|
msgstr ""
|
||||||
|
"(1) 如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-"
|
||||||
|
"tokens` 被分成多轮计算;"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:143
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:142
|
||||||
msgid ""
|
msgid ""
|
||||||
"(2) Decode requests are prioritized for scheduling, and prefill requests "
|
"(2) Decode requests are prioritized for scheduling, and prefill requests "
|
||||||
"are scheduled only if there is available capacity."
|
"are scheduled only if there is available capacity."
|
||||||
msgstr "(2) 解码请求优先调度,只有在有可用容量时才调度预填充请求。"
|
msgstr "(2) 解码请求优先调度,只有在有可用容量时才调度预填充请求。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:144
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:143
|
||||||
msgid ""
|
msgid ""
|
||||||
"Generally, if `--max-num-batched-tokens` is set to a larger value, the "
|
"Generally, if `--max-num-batched-tokens` is set to a larger value, the "
|
||||||
"overall latency will be lower, but the pressure on GPU memory (activation"
|
"overall latency will be lower, but the pressure on GPU memory (activation"
|
||||||
" value usage) will be greater."
|
" value usage) will be greater."
|
||||||
msgstr "通常,如果 `--max-num-batched-tokens` 设置得较大,整体延迟会更低,但 GPU 内存(激活值使用)的压力会更大。"
|
msgstr "通常,如果 `--max-num-batched-tokens` 设置得较大,整体延迟会更低,但 GPU 内存(激活值使用)的压力会更大。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:145
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:144
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--gpu-memory-utilization` represents the proportion of HBM that vLLM "
|
"`--gpu-memory-utilization` represents the proportion of HBM that vLLM "
|
||||||
"will use for actual inference. Its essential function is to calculate the"
|
"will use for actual inference. Its essential function is to calculate the"
|
||||||
@@ -242,16 +267,24 @@ msgid ""
|
|||||||
"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
|
"during actual inference (e.g., due to uneven EP load), setting `--gpu-"
|
||||||
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
||||||
"during actual inference. The default value is `0.9`."
|
"during actual inference. The default value is `0.9`."
|
||||||
msgstr "`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache 大小。在预热阶段(vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可用的 kv_cache 就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理时不同(例如,由于 EP 负载不均),将 `--gpu-memory-utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
msgstr ""
|
||||||
|
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
|
||||||
|
"大小。在预热阶段(vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` "
|
||||||
|
"的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * "
|
||||||
|
"HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可用的 kv_cache "
|
||||||
|
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理时不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||||
|
"utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:146
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:145
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||||
"use pure EP or pure TP."
|
"use pure EP or pure TP."
|
||||||
msgstr "`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE 要么使用纯 EP,要么使用纯 TP。"
|
msgstr ""
|
||||||
|
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||||
|
"要么使用纯 EP,要么使用纯 TP。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:147
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:146
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
|
"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
|
||||||
"To enable it, for mamba-like models Qwen3.5, set `--enable-prefix-"
|
"To enable it, for mamba-like models Qwen3.5, set `--enable-prefix-"
|
||||||
@@ -259,15 +292,19 @@ msgid ""
|
|||||||
"implementation of hybrid kv cache might result in a very large block_size"
|
"implementation of hybrid kv cache might result in a very large block_size"
|
||||||
" when scheduling. For example, the block_size may be adjusted to 2048, "
|
" when scheduling. For example, the block_size may be adjusted to 2048, "
|
||||||
"which means that any prefix shorter than 2048 will never be cached."
|
"which means that any prefix shorter than 2048 will never be cached."
|
||||||
msgstr "`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它,对于类似 Mamba 的模型 Qwen3.5,请设置 `--enable-prefix-caching` 和 `--mamba-cache-mode align`。请注意,当前混合 kv cache 的实现可能在调度时导致非常大的 block_size。例如,block_size 可能被调整为 2048,这意味着任何短于 2048 的前缀将永远不会被缓存。"
|
msgstr ""
|
||||||
|
"`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它,对于类似 Mamba 的模型 Qwen3.5,请设置 "
|
||||||
|
"`--enable-prefix-caching` 和 `--mamba-cache-mode align`。请注意,当前混合 kv cache "
|
||||||
|
"的实现可能在调度时导致非常大的 block_size。例如,block_size 可能被调整为 2048,这意味着任何短于 2048 "
|
||||||
|
"的前缀将永远不会被缓存。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:148
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:147
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--quantization` \"ascend\" indicates that quantization is used. To "
|
"`--quantization` \"ascend\" indicates that quantization is used. To "
|
||||||
"disable quantization, remove this option."
|
"disable quantization, remove this option."
|
||||||
msgstr "`--quantization` \"ascend\" 表示使用了量化。要禁用量化,请移除此选项。"
|
msgstr "`--quantization` \"ascend\" 表示使用了量化。要禁用量化,请移除此选项。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:149
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:148
|
||||||
msgid ""
|
msgid ""
|
||||||
"`--compilation-config` contains configurations related to the aclgraph "
|
"`--compilation-config` contains configurations related to the aclgraph "
|
||||||
"graph mode. The most significant configurations are \"cudagraph_mode\" "
|
"graph mode. The most significant configurations are \"cudagraph_mode\" "
|
||||||
@@ -276,9 +313,13 @@ msgid ""
|
|||||||
"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
|
"\"PIECEWISE\" and \"FULL_DECODE_ONLY\" are supported. The graph mode is "
|
||||||
"mainly used to reduce the cost of operator dispatch. Currently, "
|
"mainly used to reduce the cost of operator dispatch. Currently, "
|
||||||
"\"FULL_DECODE_ONLY\" is recommended."
|
"\"FULL_DECODE_ONLY\" is recommended."
|
||||||
msgstr "`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和 \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 \"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 \"FULL_DECODE_ONLY\"。"
|
msgstr ""
|
||||||
|
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
|
||||||
|
" \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 "
|
||||||
|
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||||
|
"\"FULL_DECODE_ONLY\"。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:151
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:150
|
||||||
msgid ""
|
msgid ""
|
||||||
"\"cudagraph_capture_sizes\": represents different levels of graph modes. "
|
"\"cudagraph_capture_sizes\": represents different levels of graph modes. "
|
||||||
"The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. "
|
"The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. "
|
||||||
@@ -286,164 +327,124 @@ msgid ""
|
|||||||
" inputs between levels are automatically padded to the next level. "
|
" inputs between levels are automatically padded to the next level. "
|
||||||
"Currently, the default setting is recommended. Only in some scenarios is "
|
"Currently, the default setting is recommended. Only in some scenarios is "
|
||||||
"it necessary to set this separately to achieve optimal performance."
|
"it necessary to set this separately to achieve optimal performance."
|
||||||
msgstr "\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
msgstr ""
|
||||||
|
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
|
||||||
|
"40,..., `--max-num-"
|
||||||
|
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:153
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:152
|
||||||
msgid "Multi-node Deployment with MP (Recommended)"
|
msgid "Multi-node Deployment with MP (Recommended)"
|
||||||
msgstr "使用 MP 的多节点部署(推荐)"
|
msgstr "使用 MP 的多节点部署(推荐)"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:155
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:154
|
||||||
msgid ""
|
msgid ""
|
||||||
"Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5"
|
"Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5"
|
||||||
"-397B-A17B` model across multiple nodes."
|
"-397B-A17B-w8a8-mtp` model across multiple nodes."
|
||||||
msgstr "假设您有 2 个 Atlas 800 A2 节点,并希望跨多个节点部署 `Qwen3.5-397B-A17B` 模型。"
|
msgstr "假设您有 2 个 Atlas 800 A2 节点,并希望跨多个节点部署 `Qwen3.5-397B-A17B-w8a8-mtp` 模型。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:157
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:156
|
||||||
msgid "Node 0"
|
msgid "Node 0"
|
||||||
msgstr "节点 0"
|
msgstr "节点 0"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:203
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:202
|
||||||
msgid "Node1"
|
msgid "Node1"
|
||||||
msgstr "节点 1"
|
msgstr "节点 1"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:253
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:252
|
||||||
msgid ""
|
msgid ""
|
||||||
"If the service starts successfully, the following information will be "
|
"If the service starts successfully, the following information will be "
|
||||||
"displayed on node 0:"
|
"displayed on node 0:"
|
||||||
msgstr "如果服务启动成功,节点 0 上将显示以下信息:"
|
msgstr "如果服务启动成功,节点 0 上将显示以下信息:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:264
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:263
|
||||||
msgid "Multi-node Deployment with Ray"
|
msgid "Multi-node Deployment with Ray"
|
||||||
msgstr "使用 Ray 的多节点部署"
|
msgstr "使用 Ray 的多节点部署"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:266
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:265
|
||||||
msgid "refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](../features/ray.md)."
|
msgid "refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](../features/ray.md)."
|
||||||
msgstr "请参考 [Ray 分布式 (Qwen/Qwen3-235B-A22B)](../features/ray.md)。"
|
msgstr "请参考 [Ray 分布式 (Qwen/Qwen3-235B-A22B)](../features/ray.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:268
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:267
|
||||||
msgid "Prefill-Decode Disaggregation"
|
msgid "Prefill-Decode Disaggregation"
|
||||||
msgstr "预填充-解码解耦"
|
msgstr "预填充-解码解耦"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:270
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:269
|
||||||
msgid ""
|
msgid ""
|
||||||
"We recommend using Mooncake for deployment: "
|
"We recommend using Mooncake for deployment: "
|
||||||
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
||||||
msgstr "我们推荐使用 Mooncake 进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
msgstr "我们推荐使用 Mooncake 进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:272
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:271
|
||||||
msgid ""
|
msgid ""
|
||||||
"Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 1P1D (3 "
|
"Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 1P1D (3 "
|
||||||
"nodes) to run Qwen3.5-397B-A17B."
|
"nodes) to run Qwen3.5-397B-A17B."
|
||||||
msgstr "以 Atlas 800 A3 (64G × 16) 为例,我们建议部署 1P1D(3 个节点)来运行 Qwen3.5-397B-A17B。"
|
msgstr "以 Atlas 800 A3 (64G × 16) 为例,我们建议部署 1P1D(3 个节点)来运行 Qwen3.5-397B-A17B。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:274
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:273
|
||||||
msgid "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` require 3 Atlas 800 A3 (64G × 16)."
|
msgid "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` require 3 Atlas 800 A3 (64G × 16)."
|
||||||
msgstr "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` 需要 3 个 Atlas 800 A3 (64G × 16)。"
|
msgstr "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` 需要 3 个 Atlas 800 A3 (64G × 16)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:276
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:275
|
||||||
msgid ""
|
msgid ""
|
||||||
"To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need "
|
"To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need "
|
||||||
"to deploy `run_p.sh` 、`run_d0.sh` and `run_d1.sh` script on each node and"
|
"to deploy `run_p.sh` 、`run_d0.sh` and `run_d1.sh` script on each node and"
|
||||||
" deploy a `proxy.sh` script on prefill master node to forward requests."
|
" deploy a `proxy.sh` script on prefill master node to forward requests."
|
||||||
msgstr "要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署 `run_p.sh`、`run_d0.sh` 和 `run_d1.sh` 脚本,并在预填充主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
msgstr "要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署 `run_p.sh`、`run_d0.sh` 和 `run_d1.sh` 脚本,并在预填充主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:278
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:277
|
||||||
msgid "Prefill Node 0 `run_p.sh` script"
|
msgid "Prefill Node 0 `run_p.sh` script"
|
||||||
msgstr "预填充节点 0 `run_p.sh` 脚本"
|
msgstr "预填充节点 0 `run_p.sh` 脚本"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:353
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:352
|
||||||
msgid "Decode Node 0 `run_d0.sh` script"
|
msgid "Decode Node 0 `run_d0.sh` script"
|
||||||
msgstr "解码节点 0 `run_d0.sh` 脚本"
|
msgstr "解码节点 0 `run_d0.sh` 脚本"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:433
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:432
|
||||||
msgid "Decode Node 1 `run_d1.sh` script"
|
msgid "Decode Node 1 `run_d1.sh` script"
|
||||||
msgstr "解码节点 1 `run_d1.sh` 脚本"
|
msgstr "解码节点 1 `run_d1.sh` 脚本"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:512
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:519
|
||||||
msgid "**Notice:** The parameters are explained as follows:"
|
|
||||||
msgstr "**注意:** 参数说明如下:"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:515
|
|
||||||
msgid ""
|
|
||||||
"`--async-scheduling`: enables the asynchronous scheduling function. When "
|
|
||||||
"Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of "
|
|
||||||
"operator delivery can be implemented to overlap the operator delivery "
|
|
||||||
"latency."
|
|
||||||
msgstr ""
|
|
||||||
"`--async-scheduling`:启用异步调度功能。当启用多令牌预测(MTP)时,可以实现算子交付的异步调度,以重叠算子交付延迟。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:516
|
|
||||||
msgid ""
|
|
||||||
"`cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And "
|
|
||||||
"the min is `n = 1` and the max is `n = max-num-seqs`. For other values, "
|
|
||||||
"it is recommended to set them to the number of frequently occurring "
|
|
||||||
"requests on the Decode (D) node."
|
|
||||||
msgstr ""
|
|
||||||
"`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大值为 `n = max-num-seqs`。对于其他值,建议设置为解码(D)节点上频繁出现的请求数量。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:517
|
|
||||||
msgid ""
|
|
||||||
"`recompute_scheduler_enable: true`: enables the recomputation scheduler. "
|
|
||||||
"When the Key-Value Cache (KV Cache) of the decode node is insufficient, "
|
|
||||||
"requests will be sent to the prefill node to recompute the KV Cache. In "
|
|
||||||
"the PD separation scenario, it is recommended to enable this "
|
|
||||||
"configuration on both prefill and decode nodes simultaneously."
|
|
||||||
msgstr ""
|
|
||||||
"`recompute_scheduler_enable: true`:启用重计算调度器。当解码节点的键值缓存(KV Cache)不足时,请求将被发送到预填充节点以重新计算 KV Cache。在 PD 分离场景下,建议同时在预填充节点和解码节点上启用此配置。"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:518
|
|
||||||
msgid ""
|
|
||||||
"`no-enable-prefix-caching`: The prefix-cache feature is enabled by "
|
|
||||||
"default. You can use the `--no-enable-prefix-caching` parameter to "
|
|
||||||
"disable this feature. Notice: for Prefill-Decode disaggregation feature, "
|
|
||||||
"known issue on D node: [#7944](https://github.com/vllm-project/vllm-"
|
|
||||||
"ascend/issues/7944)"
|
|
||||||
msgstr ""
|
|
||||||
"`no-enable-prefix-caching`:前缀缓存功能默认启用。您可以使用 `--no-enable-prefix-caching` 参数禁用此功能。注意:对于预填充-解码分离功能,D 节点上的已知问题:[#7944](https://github.com/vllm-project/vllm-ascend/issues/7944)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:520
|
|
||||||
msgid "Run the `proxy.sh` script on the prefill master node"
|
msgid "Run the `proxy.sh` script on the prefill master node"
|
||||||
msgstr "在预填充主节点上运行 `proxy.sh` 脚本"
|
msgstr "在预填充主节点上运行 `proxy.sh` 脚本"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:522
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:521
|
||||||
msgid ""
|
msgid ""
|
||||||
"Run a proxy server on the same node with the prefiller service instance. "
|
"Run a proxy server on the same node with the prefiller service instance. "
|
||||||
"You can get the proxy program in the repository's examples: "
|
"You can get the proxy program in the repository's examples: "
|
||||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||||
"project/vllm-"
|
"project/vllm-"
|
||||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
msgstr ""
|
msgstr "在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
"在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:548
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:547
|
||||||
msgid "Functional Verification"
|
msgid "Functional Verification"
|
||||||
msgstr "功能验证"
|
msgstr "功能验证"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:550
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:549
|
||||||
msgid "Once your server is started, you can query the model with input prompts:"
|
msgid "Once your server is started, you can query the model with input prompts:"
|
||||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:563
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:562
|
||||||
msgid "Accuracy Evaluation"
|
msgid "Accuracy Evaluation"
|
||||||
msgstr "精度评估"
|
msgstr "精度评估"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:565
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:564
|
||||||
msgid "Here are two accuracy evaluation methods."
|
msgid "Here are two accuracy evaluation methods."
|
||||||
msgstr "以下是两种精度评估方法。"
|
msgstr "以下是两种精度评估方法。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:567
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:566
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:579
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:578
|
||||||
msgid "Using AISBench"
|
msgid "Using AISBench"
|
||||||
msgstr "使用 AISBench"
|
msgstr "使用 AISBench"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:569
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:568
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [Using "
|
"Refer to [Using "
|
||||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||||
"details."
|
"details."
|
||||||
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:571
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:570
|
||||||
msgid ""
|
msgid ""
|
||||||
"After execution, you can get the result, here is the result of `Qwen3.5"
|
"After execution, you can get the result, here is the result of `Qwen3.5"
|
||||||
"-397B-A17B-w8a8` in `vllm-ascend:v0.17.0rc1` for reference only."
|
"-397B-A17B-w8a8` in `vllm-ascend:v0.17.0rc1` for reference only."
|
||||||
@@ -489,53 +490,53 @@ msgstr "生成"
|
|||||||
msgid "96.74"
|
msgid "96.74"
|
||||||
msgstr "96.74"
|
msgstr "96.74"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:577
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:576
|
||||||
msgid "Performance"
|
msgid "Performance"
|
||||||
msgstr "性能"
|
msgstr "性能"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:581
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:580
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [Using AISBench for performance "
|
"Refer to [Using AISBench for performance "
|
||||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||||
"performance-evaluation) for details."
|
"performance-evaluation) for details."
|
||||||
msgstr "详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
msgstr "详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:583
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:582
|
||||||
msgid "Using vLLM Benchmark"
|
msgid "Using vLLM Benchmark"
|
||||||
msgstr "使用 vLLM Benchmark"
|
msgstr "使用 vLLM Benchmark"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:585
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:584
|
||||||
msgid "Run performance evaluation of `Qwen3.5-397B-A17B-w8a8` as an example."
|
msgid "Run performance evaluation of `Qwen3.5-397B-A17B-w8a8` as an example."
|
||||||
msgstr "以运行 `Qwen3.5-397B-A17B-w8a8` 的性能评估为例。"
|
msgstr "以运行 `Qwen3.5-397B-A17B-w8a8` 的性能评估为例。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:587
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:586
|
||||||
msgid ""
|
msgid ""
|
||||||
"Refer to [vllm "
|
"Refer to [vllm "
|
||||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) "
|
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) "
|
||||||
"for more details."
|
"for more details."
|
||||||
msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:589
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:588
|
||||||
msgid "There are three `vllm bench` subcommands:"
|
msgid "There are three `vllm bench` subcommands:"
|
||||||
msgstr "`vllm bench` 有三个子命令:"
|
msgstr "`vllm bench` 有三个子命令:"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:591
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:590
|
||||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||||
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:592
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:591
|
||||||
msgid "`serve`: Benchmark the online serving throughput."
|
msgid "`serve`: Benchmark the online serving throughput."
|
||||||
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:593
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:592
|
||||||
msgid "`throughput`: Benchmark offline inference throughput."
|
msgid "`throughput`: Benchmark offline inference throughput."
|
||||||
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:595
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:594
|
||||||
msgid "Take the `serve` as an example. Run the code as follows."
|
msgid "Take the `serve` as an example. Run the code as follows."
|
||||||
msgstr "以 `serve` 为例。运行代码如下。"
|
msgstr "以 `serve` 为例。运行代码如下。"
|
||||||
|
|
||||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:602
|
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:601
|
||||||
msgid ""
|
msgid ""
|
||||||
"After about several minutes, you can get the performance evaluation "
|
"After about several minutes, you can get the performance evaluation "
|
||||||
"result."
|
"result."
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -20,8 +20,8 @@ msgstr ""
|
|||||||
"Generated-By: Babel 2.18.0\n"
|
"Generated-By: Babel 2.18.0\n"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:1
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:1
|
||||||
msgid "Fine-Grained Tensor Parallelism (Finegrained TP)"
|
msgid "Fine-Grained Tensor Parallelism (Fine-grained TP)"
|
||||||
msgstr "细粒度张量并行 (Finegrained TP)"
|
msgstr "细粒度张量并行 (Fine-grained TP)"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:3
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:3
|
||||||
msgid "Overview"
|
msgid "Overview"
|
||||||
@@ -37,7 +37,10 @@ msgid ""
|
|||||||
"model head (lm_head), attention output projection (o_proj), and MLP "
|
"model head (lm_head), attention output projection (o_proj), and MLP "
|
||||||
"blocks—via the `finegrained_tp_config` parameter."
|
"blocks—via the `finegrained_tp_config` parameter."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"细粒度张量并行 (Fine-grained TP) 扩展了标准张量并行,允许为**不同的模型组件设置独立的张量并行规模**。与对所有层应用单一的全局 `tensor_parallel_size` 不同,细粒度 TP 允许用户通过 `finegrained_tp_config` 参数为关键模块(如嵌入层、语言模型头部 (lm_head)、注意力输出投影层 (o_proj) 和 MLP 块)配置独立的 TP 规模。"
|
"细粒度张量并行 (Fine-grained TP) "
|
||||||
|
"扩展了标准张量并行,允许为**不同的模型组件设置独立的张量并行规模**。与对所有层应用单一的全局 `tensor_parallel_size` "
|
||||||
|
"不同,细粒度 TP 允许用户通过 `finegrained_tp_config` 参数为关键模块(如嵌入层、语言模型头部 "
|
||||||
|
"(lm_head)、注意力输出投影层 (o_proj) 和 MLP 块)配置独立的 TP 规模。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:7
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:7
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -47,10 +50,11 @@ msgid ""
|
|||||||
"compatible with standard dense transformer architectures and integrates "
|
"compatible with standard dense transformer architectures and integrates "
|
||||||
"seamlessly into vLLM’s serving pipeline."
|
"seamlessly into vLLM’s serving pipeline."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"此功能支持在单个模型内使用异构并行策略,从而能更精细地控制跨设备的权重分布、内存布局和通信模式。该特性与标准的密集 Transformer 架构兼容,并能无缝集成到 vLLM 的服务流水线中。"
|
"此功能支持在单个模型内使用异构并行策略,从而能更精细地控制跨设备的权重分布、内存布局和通信模式。该特性与标准的密集 Transformer "
|
||||||
|
"架构兼容,并能无缝集成到 vLLM 的服务流水线中。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:11
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:11
|
||||||
msgid "Benefits of Finegrained TP"
|
msgid "Benefits of Fine-grained TP"
|
||||||
msgstr "细粒度 TP 的优势"
|
msgstr "细粒度 TP 的优势"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:13
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:13
|
||||||
@@ -62,11 +66,12 @@ msgstr "细粒度张量并行通过有针对性的权重分片带来两个主要
|
|||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:15
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:15
|
||||||
msgid ""
|
msgid ""
|
||||||
"**Reduced Per-Device Memory Footprint**: Fine-grained TP shards large "
|
"**Reduced Per-Device Memory Footprint**: Fine-grained TP shards large "
|
||||||
"weight matrices(e.g., LM Head, o_proj)across devices, lowering peak "
|
"weight matrices (e.g., LM Head, o_proj) across devices, lowering peak "
|
||||||
"memory usage and enabling larger batches or deployment on memory-limited "
|
"memory usage and enabling larger batches or deployment on memory-limited "
|
||||||
"hardware—without quantization."
|
"hardware—without quantization."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"**降低单设备内存占用**: 细粒度 TP 将大型权重矩阵(例如 LM Head、o_proj)分片到多个设备上,降低了峰值内存使用量,从而支持更大的批次或在内存受限的硬件上进行部署——无需量化。"
|
"**降低单设备内存占用**: 细粒度 TP 将大型权重矩阵(例如 LM "
|
||||||
|
"Head、o_proj)分片到多个设备上,降低了峰值内存使用量,从而支持更大的批次或在内存受限的硬件上进行部署——无需量化。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:18
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:18
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -76,7 +81,9 @@ msgid ""
|
|||||||
"efficiency—especially for latency-sensitive layers like LM Head and "
|
"efficiency—especially for latency-sensitive layers like LM Head and "
|
||||||
"o_proj."
|
"o_proj."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"**加速 GEMM 中的内存访问**: 在解码密集型工作负载中,GEMM 性能通常受内存带宽限制。权重分片减少了每个设备需要获取的权重数据量,从而降低了 DRAM 流量并提高了带宽效率——对于 LM Head 和 o_proj 等延迟敏感层尤其如此。"
|
"**加速 GEMM 中的内存访问**: 在解码密集型工作负载中,GEMM "
|
||||||
|
"性能通常受内存带宽限制。权重分片减少了每个设备需要获取的权重数据量,从而降低了 DRAM 流量并提高了带宽效率——对于 LM Head 和 "
|
||||||
|
"o_proj 等延迟敏感层尤其如此。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:21
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:21
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -99,7 +106,9 @@ msgid ""
|
|||||||
"Fine-grained TP is **model-agnostic** and supports all standard dense "
|
"Fine-grained TP is **model-agnostic** and supports all standard dense "
|
||||||
"transformer architectures, including Llama, Qwen, DeepSeek (base/dense "
|
"transformer architectures, including Llama, Qwen, DeepSeek (base/dense "
|
||||||
"variants), and others."
|
"variants), and others."
|
||||||
msgstr "细粒度 TP 是**模型无关的**,支持所有标准的密集 Transformer 架构,包括 Llama、Qwen、DeepSeek(基础/密集变体)等。"
|
msgstr ""
|
||||||
|
"细粒度 TP 是**模型无关的**,支持所有标准的密集 Transformer 架构,包括 "
|
||||||
|
"Llama、Qwen、DeepSeek(基础/密集变体)等。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:31
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:31
|
||||||
msgid "Component & Execution Mode Support"
|
msgid "Component & Execution Mode Support"
|
||||||
@@ -161,7 +170,9 @@ msgstr "⚠️ 注意:"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"`o_proj` TP is only supported in Graph mode during Decode, because "
|
"`o_proj` TP is only supported in Graph mode during Decode, because "
|
||||||
"dummy_run in eager mode will not trigger o_proj."
|
"dummy_run in eager mode will not trigger o_proj."
|
||||||
msgstr "`o_proj` TP 仅在 Decode 阶段的 Graph 模式下受支持,因为 eager 模式下的 dummy_run 不会触发 o_proj。"
|
msgstr ""
|
||||||
|
"`o_proj` TP 仅在 Decode 阶段的 Graph 模式下受支持,因为 eager 模式下的 dummy_run 不会触发 "
|
||||||
|
"o_proj。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:43
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:43
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -194,7 +205,7 @@ msgid ""
|
|||||||
msgstr "⚠️ 违反这些约束将导致运行时错误或未定义行为。"
|
msgstr "⚠️ 违反这些约束将导致运行时错误或未定义行为。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:56
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:56
|
||||||
msgid "How to Use Finegrained TP"
|
msgid "How to Use Fine-grained TP"
|
||||||
msgstr "如何使用细粒度 TP"
|
msgstr "如何使用细粒度 TP"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:58
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md:58
|
||||||
@@ -222,7 +233,9 @@ msgid ""
|
|||||||
"decode instances in an environment of 32 cards Ascend 910B*64G (A2), with"
|
"decode instances in an environment of 32 cards Ascend 910B*64G (A2), with"
|
||||||
" parallel configuration as DP32+EP32, and fine-grained TP size of 8; the "
|
" parallel configuration as DP32+EP32, and fine-grained TP size of 8; the "
|
||||||
"performance data is as follows."
|
"performance data is as follows."
|
||||||
msgstr "为评估细粒度 TP 在大规模服务场景中的有效性,我们使用模型 **DeepSeek-R1-W8A8**,在 32 卡 Ascend 910B*64G (A2) 环境中部署 PD 分离的解码实例,并行配置为 DP32+EP32,细粒度 TP 规模为 8;性能数据如下。"
|
msgstr ""
|
||||||
|
"为评估细粒度 TP 在大规模服务场景中的有效性,我们使用模型 **DeepSeek-R1-W8A8**,在 32 卡 Ascend "
|
||||||
|
"910B*64G (A2) 环境中部署 PD 分离的解码实例,并行配置为 DP32+EP32,细粒度 TP 规模为 8;性能数据如下。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
|
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
|
||||||
msgid "Module"
|
msgid "Module"
|
||||||
@@ -304,4 +317,6 @@ msgid ""
|
|||||||
"PD separation, where models are typically deployed in all-DP mode. In "
|
"PD separation, where models are typically deployed in all-DP mode. In "
|
||||||
"this setup, sharding weight-heavy layers reduces redundant storage and "
|
"this setup, sharding weight-heavy layers reduces redundant storage and "
|
||||||
"memory pressure."
|
"memory pressure."
|
||||||
msgstr "细粒度 TP 在 PD 分离的**解码实例**中**最有效**,因为模型通常以全 DP 模式部署。在此设置中,对权重密集的层进行分片可以减少冗余存储和内存压力。"
|
msgstr ""
|
||||||
|
"细粒度 TP 在 PD 分离的**解码实例**中**最有效**,因为模型通常以全 DP "
|
||||||
|
"模式部署。在此设置中,对权重密集的层进行分片可以减少冗余存储和内存压力。"
|
||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -34,7 +34,8 @@ msgid ""
|
|||||||
"Deploying these two stages in independent vLLM instances brings three "
|
"Deploying these two stages in independent vLLM instances brings three "
|
||||||
"practical benefits:"
|
"practical benefits:"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"**解耦编码器** 将多模态大语言模型的视觉编码器阶段运行在与预填充/解码器阶段分离的进程中。将这两个阶段部署在独立的 vLLM 实例中,带来三个实际好处:"
|
"**解耦编码器** 将多模态大语言模型的视觉编码器阶段运行在与预填充/解码器阶段分离的进程中。将这两个阶段部署在独立的 vLLM "
|
||||||
|
"实例中,带来三个实际好处:"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:7
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:7
|
||||||
msgid "**Independent, fine-grained scaling**"
|
msgid "**Independent, fine-grained scaling**"
|
||||||
@@ -89,8 +90,8 @@ msgid ""
|
|||||||
"Design doc: <https://docs.google.com/document/d"
|
"Design doc: <https://docs.google.com/document/d"
|
||||||
"/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>"
|
"/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"设计文档:<https://docs.google.com/document/d"
|
"设计文档:<https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-"
|
||||||
"/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>"
|
"CpnuLLzmR8l9BAE>"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:27
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:27
|
||||||
msgid "Usage"
|
msgid "Usage"
|
||||||
@@ -107,16 +108,16 @@ msgid ""
|
|||||||
"1 Encoder instance + 1 PD instance: "
|
"1 Encoder instance + 1 PD instance: "
|
||||||
"`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`"
|
"`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"1 个编码器实例 + 1 个 PD 实例:"
|
"1 个编码器实例 + 1 个 PD "
|
||||||
"`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`"
|
"实例:`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:35
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:35
|
||||||
msgid ""
|
msgid ""
|
||||||
"1 Encoder instance + 1 Prefill instance + 1 Decode instance: "
|
"1 Encoder instance + 1 Prefill instance + 1 Decode instance: "
|
||||||
"`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`"
|
"`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"1 个编码器实例 + 1 个预填充实例 + 1 个解码实例:"
|
"1 个编码器实例 + 1 个预填充实例 + 1 "
|
||||||
"`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`"
|
"个解码实例:`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:40
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:40
|
||||||
msgid "Development"
|
msgid "Development"
|
||||||
@@ -154,7 +155,8 @@ msgid ""
|
|||||||
"instance to the PD instance. All related code is under "
|
"instance to the PD instance. All related code is under "
|
||||||
"`vllm/distributed/ec_transfer`."
|
"`vllm/distributed/ec_transfer`."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"一个连接器将编码器缓存 (EC) 嵌入向量从编码器实例传输到 PD 实例。所有相关代码位于 `vllm/distributed/ec_transfer` 目录下。"
|
"一个连接器将编码器缓存 (EC) 嵌入向量从编码器实例传输到 PD 实例。所有相关代码位于 "
|
||||||
|
"`vllm/distributed/ec_transfer` 目录下。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:53
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:53
|
||||||
msgid "Key abstractions"
|
msgid "Key abstractions"
|
||||||
@@ -175,7 +177,7 @@ msgid "*Worker role* – loads the embeddings into memory."
|
|||||||
msgstr "*工作进程角色* – 将嵌入向量加载到内存中。"
|
msgstr "*工作进程角色* – 将嵌入向量加载到内存中。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:59
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:59
|
||||||
msgid "**EPD Load Balance Proxy** -"
|
msgid "**EPD Load Balancing Proxy** -"
|
||||||
msgstr "**EPD 负载均衡代理** -"
|
msgstr "**EPD 负载均衡代理** -"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:60
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:60
|
||||||
@@ -200,12 +202,14 @@ msgid ""
|
|||||||
" to facilitate the kv transfer between P and D. For step-by-step "
|
" to facilitate the kv transfer between P and D. For step-by-step "
|
||||||
"deployment and configuration of Mooncake, refer to the following guide:"
|
"deployment and configuration of Mooncake, refer to the following guide:"
|
||||||
" "
|
" "
|
||||||
"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
|
"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"我们使用来自 `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` 的 **MooncakeLayerwiseConnector** 创建示例设置,并参考 "
|
"我们使用来自 "
|
||||||
"`examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` 来促进 P 和 D 之间的 KV 传输。关于 Mooncake 的逐步部署和配置,请参考以下指南:"
|
"`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`"
|
||||||
" "
|
" 的 **MooncakeLayerwiseConnector** 创建示例设置,并参考 "
|
||||||
"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
|
"`examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py`"
|
||||||
|
" 来促进 P 和 D 之间的 KV 传输。关于 Mooncake 的逐步部署和配置,请参考以下指南: "
|
||||||
|
"[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:66
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:66
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -218,7 +222,10 @@ msgid ""
|
|||||||
"`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` "
|
"`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` "
|
||||||
"shows the brief idea about the disaggregated prefill."
|
"shows the brief idea about the disaggregated prefill."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"对于 PD 解耦部分,当使用 MooncakeLayerwiseConnector 时:请求首先进入解码器实例,解码器通过元服务器反向触发一个远程预填充任务。然后预填充节点执行推理,并将 KV 缓存逐层推送到解码器,实现计算与传输的重叠。一旦传输完成,解码器无缝地继续后续的令牌生成。`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` 展示了关于解耦预填充的简要思路。"
|
"对于 PD 解耦部分,当使用 MooncakeLayerwiseConnector "
|
||||||
|
"时:请求首先进入解码器实例,解码器通过元服务器反向触发一个远程预填充任务。然后预填充节点执行推理,并将 KV "
|
||||||
|
"缓存逐层推送到解码器,实现计算与传输的重叠。一旦传输完成,解码器无缝地继续后续的令牌生成。`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md`"
|
||||||
|
" 展示了关于解耦预填充的简要思路。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:69
|
#: ../../source/user_guide/feature_guide/epd_disaggregation.md:69
|
||||||
msgid "Limitations"
|
msgid "Limitations"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -35,10 +35,12 @@ msgid ""
|
|||||||
"vLLM deployment, with its own endpoint, and have an external router "
|
"vLLM deployment, with its own endpoint, and have an external router "
|
||||||
"balance HTTP requests between them, making use of appropriate real-time "
|
"balance HTTP requests between them, making use of appropriate real-time "
|
||||||
"telemetry from each server for routing decisions."
|
"telemetry from each server for routing decisions."
|
||||||
msgstr "在这种情况下,将每个数据并行等级视为一个独立的 vLLM 部署(拥有自己的端点),并使用一个外部路由器在它们之间平衡 HTTP 请求,同时利用来自每个服务器的适当实时遥测数据来做出路由决策,会更加方便。"
|
msgstr ""
|
||||||
|
"在这种情况下,将每个数据并行等级视为一个独立的 vLLM 部署(拥有自己的端点),并使用一个外部路由器在它们之间平衡 HTTP "
|
||||||
|
"请求,同时利用来自每个服务器的适当实时遥测数据来做出路由决策,会更加方便。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:7
|
#: ../../source/user_guide/feature_guide/external_dp.md:7
|
||||||
msgid "Getting Start"
|
msgid "Getting Started"
|
||||||
msgstr "开始使用"
|
msgstr "开始使用"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:9
|
#: ../../source/user_guide/feature_guide/external_dp.md:9
|
||||||
@@ -47,7 +49,9 @@ msgid ""
|
|||||||
"DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external"
|
"DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external"
|
||||||
"#external-load-balancing) is already natively supported by vLLM. In vllm-"
|
"#external-load-balancing) is already natively supported by vLLM. In vllm-"
|
||||||
"ascend we provide two enhanced functionalities:"
|
"ascend we provide two enhanced functionalities:"
|
||||||
msgstr "[外部数据并行](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) 功能已由 vLLM 原生支持。在 vllm-ascend 中,我们提供了两项增强功能:"
|
msgstr ""
|
||||||
|
"[外部数据并行](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external"
|
||||||
|
"#external-load-balancing) 功能已由 vLLM 原生支持。在 vllm-ascend 中,我们提供了两项增强功能:"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:11
|
#: ../../source/user_guide/feature_guide/external_dp.md:11
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -85,7 +89,9 @@ msgid ""
|
|||||||
"parallel. These can be mock servers or actual vLLM servers. Note that "
|
"parallel. These can be mock servers or actual vLLM servers. Note that "
|
||||||
"this proxy also works with only one vLLM server running, but will fall "
|
"this proxy also works with only one vLLM server running, but will fall "
|
||||||
"back to direct request forwarding which is meaningless."
|
"back to direct request forwarding which is meaningless."
|
||||||
msgstr "首先,您需要至少运行两个处于数据并行模式的 vLLM 服务器。这些可以是模拟服务器或实际的 vLLM 服务器。请注意,此代理在仅运行一个 vLLM 服务器时也能工作,但会退化为直接请求转发,这没有意义。"
|
msgstr ""
|
||||||
|
"首先,您需要至少运行两个处于数据并行模式的 vLLM 服务器。这些可以是模拟服务器或实际的 vLLM 服务器。请注意,此代理在仅运行一个 vLLM"
|
||||||
|
" 服务器时也能工作,但会退化为直接请求转发,这没有意义。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:29
|
#: ../../source/user_guide/feature_guide/external_dp.md:29
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -93,7 +99,9 @@ msgid ""
|
|||||||
"launch script in `examples/external_online_dp`. For scenarios of large DP"
|
"launch script in `examples/external_online_dp`. For scenarios of large DP"
|
||||||
" size across multiple nodes, we recommend using our launch script for "
|
" size across multiple nodes, we recommend using our launch script for "
|
||||||
"convenience."
|
"convenience."
|
||||||
msgstr "您可以手动逐个启动外部 vLLM 数据并行服务器,也可以使用 `examples/external_online_dp` 中的启动脚本。对于跨多个节点的大规模数据并行场景,我们建议使用我们的启动脚本以方便操作。"
|
msgstr ""
|
||||||
|
"您可以手动逐个启动外部 vLLM 数据并行服务器,也可以使用 `examples/external_online_dp` "
|
||||||
|
"中的启动脚本。对于跨多个节点的大规模数据并行场景,我们建议使用我们的启动脚本以方便操作。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:31
|
#: ../../source/user_guide/feature_guide/external_dp.md:31
|
||||||
msgid "Manually Launch"
|
msgid "Manually Launch"
|
||||||
@@ -112,7 +120,12 @@ msgid ""
|
|||||||
" instances in one command on each node. It will internally call "
|
" instances in one command on each node. It will internally call "
|
||||||
"`examples/external_online_dp/run_dp_template.sh` for each DP rank with "
|
"`examples/external_online_dp/run_dp_template.sh` for each DP rank with "
|
||||||
"proper DP-related parameters."
|
"proper DP-related parameters."
|
||||||
msgstr "首先,您需要根据您的 vLLM 配置修改 `examples/external_online_dp/run_dp_template.sh`。然后,您可以使用 `examples/external_online_dp/launch_online_dp.py` 在每个节点上通过一条命令启动多个 vLLM 实例。它将在内部为每个数据并行等级调用 `examples/external_online_dp/run_dp_template.sh`,并传入适当的数据并行相关参数。"
|
msgstr ""
|
||||||
|
"首先,您需要根据您的 vLLM 配置修改 "
|
||||||
|
"`examples/external_online_dp/run_dp_template.sh`。然后,您可以使用 "
|
||||||
|
"`examples/external_online_dp/launch_online_dp.py` 在每个节点上通过一条命令启动多个 vLLM "
|
||||||
|
"实例。它将在内部为每个数据并行等级调用 "
|
||||||
|
"`examples/external_online_dp/run_dp_template.sh`,并传入适当的数据并行相关参数。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:43
|
#: ../../source/user_guide/feature_guide/external_dp.md:43
|
||||||
msgid "An example of running external DP in one single node:"
|
msgid "An example of running external DP in one single node:"
|
||||||
@@ -131,7 +144,9 @@ msgid ""
|
|||||||
"After all vLLM DP instances are launched, you can now launch the load-"
|
"After all vLLM DP instances are launched, you can now launch the load-"
|
||||||
"balance proxy server, which serves as an entrypoint for coming requests "
|
"balance proxy server, which serves as an entrypoint for coming requests "
|
||||||
"and load-balances them between vLLM DP instances."
|
"and load-balances them between vLLM DP instances."
|
||||||
msgstr "所有 vLLM 数据并行实例启动后,您现在可以启动负载均衡代理服务器。该服务器作为传入请求的入口点,并在各个 vLLM 数据并行实例之间进行负载均衡。"
|
msgstr ""
|
||||||
|
"所有 vLLM 数据并行实例启动后,您现在可以启动负载均衡代理服务器。该服务器作为传入请求的入口点,并在各个 vLLM "
|
||||||
|
"数据并行实例之间进行负载均衡。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/external_dp.md:70
|
#: ../../source/user_guide/feature_guide/external_dp.md:70
|
||||||
msgid "The proxy server has the following features:"
|
msgid "The proxy server has the following features:"
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ msgid ""
|
|||||||
msgstr ""
|
msgstr ""
|
||||||
"Project-Id-Version: vllm-ascend \n"
|
"Project-Id-Version: vllm-ascend \n"
|
||||||
"Report-Msgid-Bugs-To: \n"
|
"Report-Msgid-Bugs-To: \n"
|
||||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||||
"Language: zh_CN\n"
|
"Language: zh_CN\n"
|
||||||
@@ -24,7 +24,7 @@ msgid "Distributed DP Server With Large-Scale Expert Parallelism"
|
|||||||
msgstr "分布式数据并行服务器与大规模专家并行"
|
msgstr "分布式数据并行服务器与大规模专家并行"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:3
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:3
|
||||||
msgid "Getting Start"
|
msgid "Getting Started"
|
||||||
msgstr "快速开始"
|
msgstr "快速开始"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:5
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:5
|
||||||
@@ -42,7 +42,11 @@ msgid ""
|
|||||||
"independently, while the decoder nodes use the 192.0.0.5 node as the "
|
"independently, while the decoder nodes use the 192.0.0.5 node as the "
|
||||||
"master node."
|
"master node."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"vLLM-Ascend 现已支持在大规模**专家并行(EP)**场景下的预填充-解码(PD)解耦。为获得更好的性能,vLLM-Ascend 中应用了分布式数据并行服务器。在 PD 分离场景下,可以根据 PD 节点的不同特性实施不同的优化策略,从而实现更灵活的模型部署。以 DeepSeek 模型为例,使用 8 台 Atlas 800T A3 服务器部署模型。假设服务器 IP 从 192.0.0.1 开始到 192.0.0.8 结束。使用前 4 台服务器作为预填充节点,后 4 台服务器作为解码节点。并且预填充节点独立部署为主节点,而解码节点使用 192.0.0.5 节点作为主节点。"
|
"vLLM-Ascend 现已支持在大规模**专家并行(EP)**场景下的预填充-解码(PD)解耦。为获得更好的性能,vLLM-Ascend "
|
||||||
|
"中应用了分布式数据并行服务器。在 PD 分离场景下,可以根据 PD 节点的不同特性实施不同的优化策略,从而实现更灵活的模型部署。以 "
|
||||||
|
"DeepSeek 模型为例,使用 8 台 Atlas 800T A3 服务器部署模型。假设服务器 IP 从 192.0.0.1 开始到 "
|
||||||
|
"192.0.0.8 结束。使用前 4 台服务器作为预填充节点,后 4 台服务器作为解码节点。并且预填充节点独立部署为主节点,而解码节点使用 "
|
||||||
|
"192.0.0.5 节点作为主节点。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:8
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:8
|
||||||
msgid "Verify Multi-Node Communication Environment"
|
msgid "Verify Multi-Node Communication Environment"
|
||||||
@@ -65,7 +69,8 @@ msgid ""
|
|||||||
"the Atlas A3 generation, both intra-node and inter-node connectivity are "
|
"the Atlas A3 generation, both intra-node and inter-node connectivity are "
|
||||||
"via HCCS."
|
"via HCCS."
|
||||||
msgstr ""
|
msgstr ""
|
||||||
"所有 NPU 必须互连。对于 Atlas A2 代,节点内连接通过 HCCS,节点间连接通过 RDMA。对于 Atlas A3 代,节点内和节点间连接均通过 HCCS。"
|
"所有 NPU 必须互连。对于 Atlas A2 代,节点内连接通过 HCCS,节点间连接通过 RDMA。对于 Atlas A3 "
|
||||||
|
"代,节点内和节点间连接均通过 HCCS。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:15
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:15
|
||||||
msgid "Verification Process"
|
msgid "Verification Process"
|
||||||
@@ -145,7 +150,9 @@ msgid ""
|
|||||||
"master node independently, while the decoder nodes use the 192.0.0.5 node"
|
"master node independently, while the decoder nodes use the 192.0.0.5 node"
|
||||||
" as the master node. This leads to differences in 'dp_size_local' and "
|
" as the master node. This leads to differences in 'dp_size_local' and "
|
||||||
"'dp_rank_start'"
|
"'dp_rank_start'"
|
||||||
msgstr "请注意,预填充节点和解码节点可能具有不同的配置。在此示例中,每个预填充节点独立部署为主节点,而解码节点使用 192.0.0.5 节点作为主节点。这导致了 'dp_size_local' 和 'dp_rank_start' 的差异。"
|
msgstr ""
|
||||||
|
"请注意,预填充节点和解码节点可能具有不同的配置。在此示例中,每个预填充节点独立部署为主节点,而解码节点使用 192.0.0.5 "
|
||||||
|
"节点作为主节点。这导致了 'dp_size_local' 和 'dp_rank_start' 的差异。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:319
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:319
|
||||||
msgid "Example proxy for Distributed DP Server"
|
msgid "Example proxy for Distributed DP Server"
|
||||||
@@ -251,7 +258,10 @@ msgid ""
|
|||||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||||
"project/vllm-"
|
"project/vllm-"
|
||||||
"ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)"
|
"ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
msgstr "您可以在仓库的示例中找到代理程序,[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)"
|
msgstr ""
|
||||||
|
"您可以在仓库的示例中找到代理程序,[load_balance_proxy_server_example.py](https://github.com"
|
||||||
|
"/vllm-project/vllm-"
|
||||||
|
"ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:366
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:366
|
||||||
msgid "Benchmark"
|
msgid "Benchmark"
|
||||||
@@ -262,7 +272,9 @@ msgid ""
|
|||||||
"We recommend using aisbench tool to assess performance. "
|
"We recommend using aisbench tool to assess performance. "
|
||||||
"[aisbench](https://gitee.com/aisbench/benchmark). Execute the following "
|
"[aisbench](https://gitee.com/aisbench/benchmark). Execute the following "
|
||||||
"commands to install aisbench"
|
"commands to install aisbench"
|
||||||
msgstr "我们推荐使用 aisbench 工具评估性能。[aisbench](https://gitee.com/aisbench/benchmark)。执行以下命令安装 aisbench"
|
msgstr ""
|
||||||
|
"我们推荐使用 aisbench "
|
||||||
|
"工具评估性能。[aisbench](https://gitee.com/aisbench/benchmark)。执行以下命令安装 aisbench"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:376
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:376
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -281,7 +293,9 @@ msgid ""
|
|||||||
"You can change the configuration in the directory "
|
"You can change the configuration in the directory "
|
||||||
":`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take "
|
":`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take "
|
||||||
"`vllm_api_stream_chat.py` as an example:"
|
"`vllm_api_stream_chat.py` as an example:"
|
||||||
msgstr "您可以在目录:`benchmark/ais_bench/benchmark/configs/models/vllm_api` 中更改配置。以 `vllm_api_stream_chat.py` 为例:"
|
msgstr ""
|
||||||
|
"您可以在目录:`benchmark/ais_bench/benchmark/configs/models/vllm_api` 中更改配置。以 "
|
||||||
|
"`vllm_api_stream_chat.py` 为例:"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:411
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:411
|
||||||
msgid ""
|
msgid ""
|
||||||
@@ -293,7 +307,9 @@ msgstr "以 gsm8k 数据集为例,执行以下命令评估性能。"
|
|||||||
msgid ""
|
msgid ""
|
||||||
"For more details on commands and parameters for aisbench, refer to "
|
"For more details on commands and parameters for aisbench, refer to "
|
||||||
"[aisbench](https://gitee.com/aisbench/benchmark)"
|
"[aisbench](https://gitee.com/aisbench/benchmark)"
|
||||||
msgstr "有关 aisbench 命令和参数的更多详细信息,请参考 [aisbench](https://gitee.com/aisbench/benchmark)"
|
msgstr ""
|
||||||
|
"有关 aisbench 命令和参数的更多详细信息,请参考 "
|
||||||
|
"[aisbench](https://gitee.com/aisbench/benchmark)"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:419
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:419
|
||||||
msgid "Prefill & Decode Configuration Details"
|
msgid "Prefill & Decode Configuration Details"
|
||||||
@@ -368,7 +384,9 @@ msgid ""
|
|||||||
"is 7K. In this scenario, we give a recommended configuration for "
|
"is 7K. In this scenario, we give a recommended configuration for "
|
||||||
"distributed DP server with high EP. Here we use 4 nodes for prefill and 4"
|
"distributed DP server with high EP. Here we use 4 nodes for prefill and 4"
|
||||||
" nodes for decode."
|
" nodes for decode."
|
||||||
msgstr "例如,如果平均输入长度为 3.5k,输出长度为 1.1k,上下文长度为 16k,输入数据集的最大长度为 7K。在此场景下,我们为具有高 EP 的分布式数据并行服务器提供了一个推荐配置。这里我们使用 4 个节点进行预填充,4 个节点进行解码。"
|
msgstr ""
|
||||||
|
"例如,如果平均输入长度为 3.5k,输出长度为 1.1k,上下文长度为 16k,输入数据集的最大长度为 7K。在此场景下,我们为具有高 EP "
|
||||||
|
"的分布式数据并行服务器提供了一个推荐配置。这里我们使用 4 个节点进行预填充,4 个节点进行解码。"
|
||||||
|
|
||||||
#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
|
#: ../../source/user_guide/feature_guide/large_scale_ep.md:282
|
||||||
msgid "node"
|
msgid "node"
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user