[v0.18.0][Doc] Translated Doc files 2026-04-22 (#8565)
## Auto-Translation Summary Translated **43** file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24767290887) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
This commit is contained in:
@@ -1,14 +1,7 @@
|
||||
# SOME DESCRIPTIVE TITLE.
|
||||
# Copyright (C) 2025, vllm-ascend team
|
||||
# This file is distributed under the same license as the vllm-ascend
|
||||
# package.
|
||||
# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
|
||||
#
|
||||
msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -187,8 +180,8 @@ msgstr "`--tensor-parallel-size` 16 是张量并行(TP)大小的常见设置
|
||||
|
||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:305
|
||||
msgid ""
|
||||
"`--prefill-context-parallel-size` 2 are common settings for prefill "
|
||||
"context parallelism (PCP) sizes."
|
||||
"`--prefill-context-parallel-size` 2 is common setting for prefill context"
|
||||
" parallelism (PCP) sizes."
|
||||
msgstr "`--prefill-context-parallel-size` 2 是预填充上下文并行(PCP)大小的常见设置。"
|
||||
|
||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:306
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -40,7 +40,9 @@ msgid ""
|
||||
"demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two "
|
||||
"Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies "
|
||||
"4 NPU cards and uses PD-colocated deployment."
|
||||
msgstr "本指南以 Qwen2.5-72B-Instruct 模型为例,演示如何在两个 Atlas 800T A2 节点上使用 vllm-ascend v0.11.0(包含 vLLM v0.11.0)部署两个 vLLM 实例。每个实例占用 4 个 NPU 卡,并采用 PD 共置部署。"
|
||||
msgstr ""
|
||||
"本指南以 Qwen2.5-72B-Instruct 模型为例,演示如何在两个 Atlas 800T A2 节点上使用 vllm-ascend "
|
||||
"v0.11.0(包含 vLLM v0.11.0)部署两个 vLLM 实例。每个实例占用 4 个 NPU 卡,并采用 PD 共置部署。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:14
|
||||
msgid "Verify Multi-Node Communication Environment"
|
||||
@@ -128,7 +130,10 @@ msgid ""
|
||||
"Mooncake is the serving platform for Kimi, a leading LLM service provided"
|
||||
" by Moonshot AI. Installation and compilation guide: <https://github.com"
|
||||
"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>."
|
||||
msgstr "Mooncake 是 Kimi 的服务平台,Kimi 是由 Moonshot AI 提供的领先 LLM 服务。安装和编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>。"
|
||||
msgstr ""
|
||||
"Mooncake 是 Kimi 的服务平台,Kimi 是由 Moonshot AI 提供的领先 LLM "
|
||||
"服务。安装和编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file"
|
||||
"#build-and-use-binaries>。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:121
|
||||
msgid "First, obtain the Mooncake project using the following command:"
|
||||
@@ -275,7 +280,10 @@ msgid ""
|
||||
" cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3]"
|
||||
" on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3]"
|
||||
" on the second server."
|
||||
msgstr "在节点 1 和节点 2 上分别创建容器,并在每个容器中启动 Qwen2.5-72B-Instruct 模型服务,以测试跨节点、跨实例 KV Cache 的可重用性和性能。实例 1 使用第一个 Atlas 800T A2 服务器上的 NPU 卡 [0-3],而实例 2 使用第二个服务器上的卡 [0-3]。"
|
||||
msgstr ""
|
||||
"在节点 1 和节点 2 上分别创建容器,并在每个容器中启动 Qwen2.5-72B-Instruct 模型服务,以测试跨节点、跨实例 KV "
|
||||
"Cache 的可重用性和性能。实例 1 使用第一个 Atlas 800T A2 服务器上的 NPU 卡 [0-3],而实例 2 "
|
||||
"使用第二个服务器上的卡 [0-3]。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:208
|
||||
msgid "Deploy Instance 1"
|
||||
@@ -430,9 +438,9 @@ msgstr "步骤 2 的准备工作"
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:285
|
||||
msgid ""
|
||||
"Before Step 2, send a fully random Dataset B to Instance 1. Due to the "
|
||||
"unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,"
|
||||
" Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's"
|
||||
" cache only in Node 1's DRAM."
|
||||
"unified on-chip memory/DRAM KV Cache with LRU (Least Recently Used) "
|
||||
"eviction policy, Dataset B's cache evicts Dataset A's cache from on-chip "
|
||||
"memory, leaving Dataset A's cache only in Node 1's DRAM."
|
||||
msgstr "在步骤2之前,向实例1发送一个完全随机的数据集B。由于采用了具有LRU(最近最少使用)淘汰策略的统一HBM/DRAM KV缓存,数据集B的缓存会将数据集A的缓存从HBM中淘汰,使得数据集A的缓存仅保留在节点1的DRAM中。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:290
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -40,7 +40,7 @@ msgid ""
|
||||
"servers to deploy the \"2P1D\" architecture. Assume the IP of the "
|
||||
"prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and "
|
||||
"the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). "
|
||||
"On each server, use 8 NPUs 16 chips to deploy one service instance."
|
||||
"On each server, use 8 NPUs and 16 chips to deploy one service instance."
|
||||
msgstr ""
|
||||
"以 Deepseek-r1-w8a8 模型为例,使用 4 台 Atlas 800T A3 服务器部署 \"2P1D\" 架构。假设预填充服务器 "
|
||||
"IP 为 192.0.0.1(预填充节点 1)和 192.0.0.2(预填充节点 2),解码服务器 IP 为 192.0.0.3(解码节点 1)和"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -30,16 +30,17 @@ msgstr "开始使用"
|
||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:5
|
||||
msgid ""
|
||||
"vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide "
|
||||
"takes one-by-one steps to verify these features with constrained "
|
||||
"resources."
|
||||
msgstr "vLLM-Ascend 现已支持预填充-解码 (PD) 解耦架构。本指南将逐步引导您在有限资源下验证这些功能。"
|
||||
"provides step-by-step instructions to verify this features in resource-"
|
||||
"constrained environments."
|
||||
msgstr "vLLM-Ascend 现已支持预填充-解码 (PD) 解耦架构。本指南提供逐步说明,帮助您在资源受限的环境中验证这些功能。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:7
|
||||
msgid ""
|
||||
"Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend "
|
||||
"Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend "
|
||||
"v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "
|
||||
"\"1P1D\" architecture. Assume the IP address is 192.0.0.1."
|
||||
msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例,在 1 台 Atlas 800T A2 服务器上使用 vLLM-Ascend v0.11.0rc1 (包含 vLLM v0.11.0) 部署 \"1P1D\" 架构。假设 IP 地址为 192.0.0.1。"
|
||||
"\"1P1D\" architecture (one Prefiller and one Decoder on the same node). "
|
||||
"Assume the IP address is 192.0.0.1."
|
||||
msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例,在 1 台 Atlas 800T A2 服务器上使用 vllm-ascend v0.11.0rc1(包含 vLLM v0.11.0)部署 \"1P1D\" 架构(同一节点上一个预填充器和一个解码器)。假设 IP 地址为 192.0.0.1。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:9
|
||||
msgid "Verify Communication Environment"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -32,32 +32,25 @@ msgid ""
|
||||
"DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-"
|
||||
"thinking mode. Compared to the previous version, this upgrade brings "
|
||||
"improvements in multiple aspects:"
|
||||
msgstr ""
|
||||
"DeepSeek-V3.1 是一个支持思考模式和非思考模式的混合模型。与前一版本相比,此"
|
||||
"次升级在多个方面带来了改进:"
|
||||
msgstr "DeepSeek-V3.1 是一个支持思考模式和非思考模式的混合模型。与前一版本相比,此次升级在多个方面带来了改进:"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:7
|
||||
msgid ""
|
||||
"Hybrid thinking mode: One model supports both thinking mode and non-"
|
||||
"thinking mode by changing the chat template."
|
||||
msgstr ""
|
||||
"混合思考模式:一个模型通过更改聊天模板,同时支持思考模式和非思考模式。"
|
||||
msgstr "混合思考模式:一个模型通过更改聊天模板,同时支持思考模式和非思考模式。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:9
|
||||
msgid ""
|
||||
"Smarter tool calling: Through post-training optimization, the model's "
|
||||
"performance in tool usage and agent tasks has significantly improved."
|
||||
msgstr ""
|
||||
"更智能的工具调用:通过后训练优化,模型在工具使用和智能体任务方面的性能显著提"
|
||||
"升。"
|
||||
msgstr "更智能的工具调用:通过后训练优化,模型在工具使用和智能体任务方面的性能显著提升。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:11
|
||||
msgid ""
|
||||
"Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable "
|
||||
"answer quality to DeepSeek-R1-0528, while responding more quickly."
|
||||
msgstr ""
|
||||
"更高的思考效率:DeepSeek-V3.1-Think 实现了与 DeepSeek-R1-0528 相当的答案质"
|
||||
"量,同时响应速度更快。"
|
||||
msgstr "更高的思考效率:DeepSeek-V3.1-Think 实现了与 DeepSeek-R1-0528 相当的答案质量,同时响应速度更快。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:13
|
||||
msgid "The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`."
|
||||
@@ -69,9 +62,7 @@ msgid ""
|
||||
"including supported features, feature configuration, environment "
|
||||
"preparation, single-node and multi-node deployment, accuracy and "
|
||||
"performance evaluation."
|
||||
msgstr ""
|
||||
"本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点"
|
||||
"和多节点部署、精度和性能评估。"
|
||||
msgstr "本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:17
|
||||
msgid "Supported Features"
|
||||
@@ -90,9 +81,7 @@ msgstr ""
|
||||
msgid ""
|
||||
"Refer to [feature guide](../../user_guide/feature_guide/index.md) to get "
|
||||
"the feature's configuration."
|
||||
msgstr ""
|
||||
"请参考 [特性指南](../../user_guide/feature_guide/index.md) 以获取特性的配"
|
||||
"置。"
|
||||
msgstr "请参考 [特性指南](../../user_guide/feature_guide/index.md) 以获取特性的配置。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:23
|
||||
msgid "Environment Preparation"
|
||||
@@ -107,8 +96,8 @@ msgid ""
|
||||
"`DeepSeek-V3.1`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1`(BF16 版本):[下载模型权重](https://www.modelscope.cn/"
|
||||
"models/deepseek-ai/DeepSeek-V3.1)。"
|
||||
"`DeepSeek-V3.1`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models"
|
||||
"/deepseek-ai/DeepSeek-V3.1)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:28
|
||||
msgid ""
|
||||
@@ -116,9 +105,9 @@ msgid ""
|
||||
"[Download model weight](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`(混合 MTP 量化版本):[下载模型权重]"
|
||||
"(https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-"
|
||||
"QuaRot)。"
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`(混合 MTP "
|
||||
"量化版本):[下载模型权重](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:29
|
||||
msgid ""
|
||||
@@ -126,9 +115,9 @@ msgid ""
|
||||
" [Download model weight](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(混合 MTP 量化版本):[下载模型权"
|
||||
"重](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-"
|
||||
"mtp-QuaRot)。"
|
||||
"`DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(混合 MTP "
|
||||
"量化版本):[下载模型权重](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1"
|
||||
"-Terminus-w4a8-mtp-QuaRot)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:30
|
||||
#, python-format
|
||||
@@ -137,8 +126,7 @@ msgid ""
|
||||
"[msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96)."
|
||||
" You can use this method to quantize the model."
|
||||
msgstr ""
|
||||
"`量化方法`:"
|
||||
"[msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96)。"
|
||||
"`量化方法`:[msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96)。"
|
||||
" 您可以使用此方法对模型进行量化。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:32
|
||||
@@ -157,8 +145,8 @@ msgid ""
|
||||
"node communication according to [verify multi-node communication "
|
||||
"environment](../../installation.md#verify-multi-node-communication)."
|
||||
msgstr ""
|
||||
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation."
|
||||
"md#verify-multi-node-communication) 验证多节点通信。"
|
||||
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||
"communication) 验证多节点通信。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:38
|
||||
msgid "Installation"
|
||||
@@ -174,8 +162,8 @@ msgid ""
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考 [使用 docker]"
|
||||
"(../../installation.md#set-up-using-docker)。"
|
||||
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考 [使用 docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:80
|
||||
msgid ""
|
||||
@@ -195,9 +183,7 @@ msgstr "单节点部署"
|
||||
msgid ""
|
||||
"Quantized model `DeepSeek-V3.1-w8a8-mtp-QuaRot` can be deployed on 1 "
|
||||
"Atlas 800 A3 (64G × 16)."
|
||||
msgstr ""
|
||||
"量化模型 `DeepSeek-V3.1-w8a8-mtp-QuaRot` 可以部署在 1 台 Atlas 800 A3 "
|
||||
"(64G × 16)上。"
|
||||
msgstr "量化模型 `DeepSeek-V3.1-w8a8-mtp-QuaRot` 可以部署在 1 台 Atlas 800 A3 (64G × 16)上。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:88
|
||||
msgid "Run the following script to execute online inference."
|
||||
@@ -215,9 +201,8 @@ msgid ""
|
||||
" Furthermore, enabling this feature is not recommended in scenarios where"
|
||||
" PD is separated."
|
||||
msgstr ""
|
||||
"设置环境变量 `VLLM_ASCEND_BALANCE_SCHEDULING=1` 启用均衡调度。这可能有助于"
|
||||
"在 v1 调度器中提高输出吞吐量并降低 TPOT。然而,在某些场景下 TTFT 可能会下"
|
||||
"降。此外,在 PD 分离的场景中不建议启用此功能。"
|
||||
"设置环境变量 `VLLM_ASCEND_BALANCE_SCHEDULING=1` 启用均衡调度。这可能有助于在 v1 "
|
||||
"调度器中提高输出吞吐量并降低 TPOT。然而,在某些场景下 TTFT 可能会下降。此外,在 PD 分离的场景中不建议启用此功能。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:135
|
||||
msgid ""
|
||||
@@ -233,24 +218,20 @@ msgid ""
|
||||
"`16384` is sufficient, however, for precision testing, please set it at "
|
||||
"least `35000`."
|
||||
msgstr ""
|
||||
"`--max-model-len` 指定最大上下文长度——即单个请求的输入和输出令牌之和。对于输"
|
||||
"入长度为 3.5K 和输出长度为 1.5K 的性能测试,`16384` 的值就足够了,但是,对于"
|
||||
"精度测试,请至少将其设置为 `35000`。"
|
||||
"`--max-model-len` 指定最大上下文长度——即单个请求的输入和输出令牌之和。对于输入长度为 3.5K 和输出长度为 1.5K "
|
||||
"的性能测试,`16384` 的值就足够了,但是,对于精度测试,请至少将其设置为 `35000`。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:137
|
||||
msgid ""
|
||||
"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
|
||||
"To enable it, remove this option."
|
||||
msgstr ""
|
||||
"`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它,请移除此选项。"
|
||||
msgstr "`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它,请移除此选项。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:138
|
||||
msgid ""
|
||||
"If you use the w4a8 weight, more memory will be allocated to kvcache, and"
|
||||
" you can try to increase system throughput to achieve greater throughput."
|
||||
msgstr ""
|
||||
"如果使用 w4a8 权重,将分配更多内存给 kvcache,您可以尝试增加系统吞吐量以实现"
|
||||
"更大的吞吐量。"
|
||||
msgstr "如果使用 w4a8 权重,将分配更多内存给 kvcache,您可以尝试增加系统吞吐量以实现更大的吞吐量。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:140
|
||||
msgid "Multi-node Deployment"
|
||||
@@ -260,8 +241,7 @@ msgstr "多节点部署"
|
||||
msgid ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`: require at least 2 Atlas 800 A2 (64G × "
|
||||
"8)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`:需要至少 2 台 Atlas 800 A2(64G × 8)。"
|
||||
msgstr "`DeepSeek-V3.1-w8a8-mtp-QuaRot`:需要至少 2 台 Atlas 800 A2(64G × 8)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:144
|
||||
msgid "Run the following scripts on two nodes respectively."
|
||||
@@ -284,8 +264,8 @@ msgid ""
|
||||
"We recommend using Mooncake for deployment: "
|
||||
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
||||
msgstr ""
|
||||
"我们建议使用 Mooncake 进行部署:[Mooncake](../features/"
|
||||
"pd_disaggregation_mooncake_multi_node.md)。"
|
||||
"我们建议使用 Mooncake "
|
||||
"进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:256
|
||||
msgid ""
|
||||
@@ -293,27 +273,27 @@ msgid ""
|
||||
"nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory "
|
||||
"to serve high concurrency in 1P1D case."
|
||||
msgstr ""
|
||||
"以 Atlas 800 A3(64G × 16)为例,我们建议部署 2P1D(4 个节点)而不是 1P1D"
|
||||
"(2 个节点),因为在 1P1D 情况下没有足够的 NPU 内存来服务高并发。"
|
||||
"以 Atlas 800 A3(64G × 16)为例,我们建议部署 2P1D(4 个节点)而不是 1P1D(2 个节点),因为在 1P1D "
|
||||
"情况下没有足够的 NPU 内存来服务高并发。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:258
|
||||
msgid ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 "
|
||||
"(64G × 16)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` 需要 4 台 Atlas 800 A3 "
|
||||
"(64G × 16)。"
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` 需要 4 台 Atlas 800 A3 (64G ×"
|
||||
" 16)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:260
|
||||
msgid ""
|
||||
"To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need "
|
||||
"to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` "
|
||||
"to deploy a `launch_online_dp.py` script and a `run_dp_template.sh` "
|
||||
"script on each node and deploy a `proxy.sh` script on prefill master node"
|
||||
" to forward requests."
|
||||
msgstr ""
|
||||
"要运行 vllm-ascend `Prefill-Decode 解耦`服务,您需要在每个节点上部署一个 "
|
||||
"`launch_dp_program.py` 脚本和一个 `run_dp_template.sh` 脚本,并在 prefill "
|
||||
"主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||
"`launch_online_dp.py` 脚本和一个 `run_dp_template.sh` 脚本,并在 prefill 主节点上部署一个 "
|
||||
"`proxy.sh` 脚本来转发请求。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:262
|
||||
msgid ""
|
||||
@@ -321,9 +301,9 @@ msgid ""
|
||||
"[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||
msgstr ""
|
||||
"`launch_online_dp.py` 用于启动外部 dp vllm 服务器。[launch\\_online\\_dp."
|
||||
"py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/"
|
||||
"external_online_dp/launch_online_dp.py)"
|
||||
"`launch_online_dp.py` 用于启动外部 dp vllm "
|
||||
"服务器。[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:265
|
||||
msgid "Prefill Node 0 `run_dp_template.sh` script"
|
||||
@@ -358,8 +338,8 @@ msgid ""
|
||||
"Prefill-Decode (PD) separation scenario, enable MLAPO only on decode "
|
||||
"nodes."
|
||||
msgstr ""
|
||||
"`VLLM_ASCEND_ENABLE_MLAPO=1`:启用融合算子,这可以显著提高性能但会消耗更多 "
|
||||
"NPU 内存。在 Prefill-Decode (PD) 分离场景中,仅在 decode 节点上启用 MLAPO。"
|
||||
"`VLLM_ASCEND_ENABLE_MLAPO=1`:启用融合算子,这可以显著提高性能但会消耗更多 NPU 内存。在 Prefill-"
|
||||
"Decode (PD) 分离场景中,仅在 decode 节点上启用 MLAPO。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:576
|
||||
msgid ""
|
||||
@@ -367,9 +347,7 @@ msgid ""
|
||||
"Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of "
|
||||
"operator delivery can be implemented to overlap the operator delivery "
|
||||
"latency."
|
||||
msgstr ""
|
||||
"`--async-scheduling`:启用异步调度功能。当启用多令牌预测 (MTP) 时,可以实现算"
|
||||
"子交付的异步调度,以重叠算子交付延迟。"
|
||||
msgstr "`--async-scheduling`:启用异步调度功能。当启用多令牌预测 (MTP) 时,可以实现算子交付的异步调度,以重叠算子交付延迟。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:577
|
||||
msgid ""
|
||||
@@ -378,9 +356,8 @@ msgid ""
|
||||
"it is recommended to set them to the number of frequently occurring "
|
||||
"requests on the Decode (D) node."
|
||||
msgstr ""
|
||||
"`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大"
|
||||
"值为 `n = max-num-seqs`。对于其他值,建议将其设置为 Decode (D) 节点上频繁出"
|
||||
"现的请求数量。"
|
||||
"`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大值为 `n = "
|
||||
"max-num-seqs`。对于其他值,建议将其设置为 Decode (D) 节点上频繁出现的请求数量。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:578
|
||||
msgid ""
|
||||
@@ -390,9 +367,9 @@ msgid ""
|
||||
"the PD separation scenario, it is recommended to enable this "
|
||||
"configuration on both prefill and decode nodes simultaneously."
|
||||
msgstr ""
|
||||
"`recompute_scheduler_enable: true`:启用重计算调度器。当 decode 节点的键值缓"
|
||||
"存 (KV Cache) 不足时,请求将被发送到 prefill 节点以重新计算 KV Cache。在 PD "
|
||||
"分离场景中,建议同时在 prefill 和 decode 节点上启用此配置。"
|
||||
"`recompute_scheduler_enable: true`:启用重计算调度器。当 decode 节点的键值缓存 (KV Cache) "
|
||||
"不足时,请求将被发送到 prefill 节点以重新计算 KV Cache。在 PD 分离场景中,建议同时在 prefill 和 decode "
|
||||
"节点上启用此配置。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:579
|
||||
msgid ""
|
||||
@@ -402,8 +379,7 @@ msgid ""
|
||||
"improved efficiency."
|
||||
msgstr ""
|
||||
"`multistream_overlap_shared_expert: true`:当张量并行 (TP) 大小为 1 或 "
|
||||
"`enable_shared_expert_dp: true` 时,启用额外的流来重叠共享专家的计算过程,以"
|
||||
"提高效率。"
|
||||
"`enable_shared_expert_dp: true` 时,启用额外的流来重叠共享专家的计算过程,以提高效率。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:580
|
||||
msgid ""
|
||||
@@ -412,9 +388,8 @@ msgid ""
|
||||
"embedding layer to be greater than 1, which is used to reduce the "
|
||||
"computational load of each card on the LMHead embedding layer."
|
||||
msgstr ""
|
||||
"`lmhead_tensor_parallel_size: 16`:当 decode 节点的张量并行 (TP) 大小为 1 "
|
||||
"时,此参数允许 LMHead 嵌入层的 TP 大小大于 1,用于减少每张卡在 LMHead 嵌入层"
|
||||
"上的计算负载。"
|
||||
"`lmhead_tensor_parallel_size: 16`:当 decode 节点的张量并行 (TP) 大小为 1 时,此参数允许 "
|
||||
"LMHead 嵌入层的 TP 大小大于 1,用于减少每张卡在 LMHead 嵌入层上的计算负载。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:582
|
||||
msgid "run server for each node:"
|
||||
@@ -431,7 +406,10 @@ msgid ""
|
||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||
"project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:653
|
||||
msgid "Functional Verification"
|
||||
@@ -466,7 +444,9 @@ msgid ""
|
||||
"After execution, you can get the result, here is the result of "
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference "
|
||||
"only."
|
||||
msgstr "执行后,您可以获得结果。以下是 `vllm-ascend:0.11.0rc1` 中 `DeepSeek-V3.1-w8a8-mtp-QuaRot` 的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果。以下是 `vllm-ascend:0.11.0rc1` 中 `DeepSeek-V3.1-w8a8-mtp-QuaRot`"
|
||||
" 的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:44
|
||||
msgid "dataset"
|
||||
@@ -541,7 +521,10 @@ msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参考[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参考[使用 AISBench "
|
||||
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:693
|
||||
msgid "The performance result is:"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -74,41 +74,56 @@ msgstr "模型权重"
|
||||
msgid ""
|
||||
"`GLM-4.5`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5)."
|
||||
msgstr "`GLM-4.5`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5)。"
|
||||
msgstr ""
|
||||
"`GLM-4.5`(BF16 "
|
||||
"版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:22
|
||||
msgid ""
|
||||
"`GLM-4.6`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6)."
|
||||
msgstr "`GLM-4.6`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6)。"
|
||||
msgstr ""
|
||||
"`GLM-4.6`(BF16 "
|
||||
"版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:23
|
||||
msgid ""
|
||||
"`GLM-4.7`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7)."
|
||||
msgstr "`GLM-4.7`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7)。"
|
||||
msgstr ""
|
||||
"`GLM-4.7`(BF16 "
|
||||
"版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:24
|
||||
msgid ""
|
||||
"`GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download "
|
||||
"model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8)."
|
||||
msgstr "`GLM-4.5-w8a8-with-float-mtp`(带 mtp 的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8)。"
|
||||
msgstr ""
|
||||
"`GLM-4.5-w8a8-with-float-mtp`(带 mtp "
|
||||
"的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:25
|
||||
msgid ""
|
||||
"`GLM-4.6-w8a8`(Quantized version without mtp): [Download model "
|
||||
"weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because "
|
||||
"vllm do not support GLM4.6 mtp in October, so we do not provide mtp "
|
||||
"version. And last month, it supported, you can use the following "
|
||||
"quantization scheme to add mtp weights to Quantized weights."
|
||||
msgstr "`GLM-4.6-w8a8`(不带 mtp 的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8)。由于 vllm 在十月份不支持 GLM4.6 的 mtp,因此我们不提供 mtp 版本。上个月已支持,您可以使用以下量化方案将 mtp 权重添加到量化权重中。"
|
||||
"vllm does not support GLM4.6 mtp in October, we do not provide an mtp "
|
||||
"version. Last month, it was supported; you can use the following "
|
||||
"quantization scheme to add mtp weights to the quantized weights."
|
||||
msgstr ""
|
||||
"`GLM-4.6-w8a8`(不带 mtp "
|
||||
"的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8)。由于"
|
||||
" vllm 在十月份不支持 GLM4.6 的 mtp,因此我们不提供 mtp 版本。上个月已支持,您可以使用以下量化方案将 mtp "
|
||||
"权重添加到量化权重中。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:26
|
||||
msgid ""
|
||||
"`GLM-4.7-w8a8-with-float-mtp`(Quantized version without mtp): [Download "
|
||||
"model weight](https://modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-4.7-W8A8-floatmtp)."
|
||||
msgstr "`GLM-4.7-w8a8-with-float-mtp`(不带 mtp 的量化版本):[下载模型权重](https://modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8-floatmtp)。"
|
||||
msgstr ""
|
||||
"`GLM-4.7-w8a8-with-float-mtp`(不带 mtp "
|
||||
"的量化版本):[下载模型权重](https://modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-4.7-W8A8-floatmtp)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:27
|
||||
msgid ""
|
||||
@@ -136,14 +151,17 @@ msgid "A3 series"
|
||||
msgstr "A3 系列"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:42
|
||||
#: ../../source/tutorials/models/GLM4.x.md:85
|
||||
msgid "Start the docker image on your each node."
|
||||
msgstr "在您的每个节点上启动 docker 镜像。"
|
||||
msgid "Start the docker image on each node."
|
||||
msgstr "在每个节点上启动 docker 镜像。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md
|
||||
msgid "A2 series"
|
||||
msgstr "A2 系列"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:85
|
||||
msgid "Start the docker image on your each node."
|
||||
msgstr "在每个节点上启动 docker 镜像。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:118
|
||||
msgid ""
|
||||
"In addition, if you don't want to use the docker image as above, you can "
|
||||
@@ -180,7 +198,12 @@ msgid ""
|
||||
"The optimization of the FIA operator will be enabled by default in CANN "
|
||||
"9.x releases, and manual replacement will no longer be required. Please "
|
||||
"stay tuned for updates to this document."
|
||||
msgstr "我们已在 CANN 8.5.1 中优化了 FIA 算子。需要手动替换与 FIA 算子相关的文件。请执行 FIA 算子替换脚本:[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh) 和 [A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)。FIA 算子的优化将在 CANN 9.x 版本中默认启用,届时将不再需要手动替换。请关注本文档的更新。"
|
||||
msgstr ""
|
||||
"我们已在 CANN 8.5.1 中优化了 FIA 算子。需要手动替换与 FIA 算子相关的文件。请执行 FIA "
|
||||
"算子替换脚本:[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh)"
|
||||
" 和 "
|
||||
"[A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)。FIA"
|
||||
" 算子的优化将在 CANN 9.x 版本中默认启用,届时将不再需要手动替换。请关注本文档的更新。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:132
|
||||
msgid "Single-node Deployment"
|
||||
@@ -194,144 +217,155 @@ msgstr "在低延迟场景下,我们推荐单机部署。"
|
||||
msgid ""
|
||||
"Quantized model `glm4.7_w8a8_with_float_mtp` can be deployed on 1 Atlas "
|
||||
"800 A3 (64G × 16) or 1 Atlas 800 A2 (64G × 8)."
|
||||
msgstr "量化模型 `glm4.7_w8a8_with_float_mtp` 可以部署在 1 台 Atlas 800 A3(64G × 16)或 1 台 Atlas 800 A2(64G × 8)上。"
|
||||
msgstr ""
|
||||
"量化模型 `glm4.7_w8a8_with_float_mtp` 可以部署在 1 台 Atlas 800 A3(64G × 16)或 1 台 "
|
||||
"Atlas 800 A2(64G × 8)上。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:137
|
||||
msgid "Run the following script to execute online inference."
|
||||
msgstr "运行以下脚本以执行在线推理。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:169
|
||||
#: ../../source/tutorials/models/GLM4.x.md:168
|
||||
msgid "**Notice:** The parameters are explained as follows:"
|
||||
msgstr "**注意:** 参数解释如下:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:172
|
||||
#: ../../source/tutorials/models/GLM4.x.md:171
|
||||
msgid ""
|
||||
"`--async-scheduling` Asynchronous scheduling is a technique used to "
|
||||
"optimize inference efficiency. It allows non-blocking task scheduling to "
|
||||
"improve concurrency and throughput, especially when processing large-"
|
||||
"scale models."
|
||||
msgstr "`--async-scheduling` 异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,特别是在处理大规模模型时。"
|
||||
msgstr ""
|
||||
"`--async-scheduling` "
|
||||
"异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,特别是在处理大规模模型时。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:173
|
||||
#: ../../source/tutorials/models/GLM4.x.md:172
|
||||
msgid ""
|
||||
"`fusion_ops_gmmswigluquant` The performance of the GmmSwigluQuant fusion "
|
||||
"operator tends to degrade when the total number of NPUs is ≤ 16."
|
||||
msgstr "`fusion_ops_gmmswigluquant` 当 NPU 总数 ≤ 16 时,GmmSwigluQuant 融合算子的性能往往会下降。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:175
|
||||
#: ../../source/tutorials/models/GLM4.x.md:174
|
||||
msgid "Multi-node Deployment"
|
||||
msgstr "多节点部署"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:177
|
||||
#: ../../source/tutorials/models/GLM4.x.md:176
|
||||
msgid ""
|
||||
"Although the former tutorial said \"Not recommended to deploy multi-node "
|
||||
"on Atlas 800 A2 (64G × 8)\", but if you insist to deploy GLM-4.x model on"
|
||||
" multi-node like 2 × Atlas 800 A2 (64G × 8), run the following scripts on"
|
||||
" two nodes respectively."
|
||||
msgstr "尽管之前的教程提到“不建议在 Atlas 800 A2(64G × 8)上部署多节点”,但如果您坚持要在类似 2 × Atlas 800 A2(64G × 8)的多节点上部署 GLM-4.x 模型,请分别在两个节点上运行以下脚本。"
|
||||
msgstr ""
|
||||
"尽管之前的教程提到“不建议在 Atlas 800 A2(64G × 8)上部署多节点”,但如果您坚持要在类似 2 × Atlas 800 "
|
||||
"A2(64G × 8)的多节点上部署 GLM-4.x 模型,请分别在两个节点上运行以下脚本。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:179
|
||||
#: ../../source/tutorials/models/GLM4.x.md:178
|
||||
msgid "**Node 0**"
|
||||
msgstr "**节点 0**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:230
|
||||
#: ../../source/tutorials/models/GLM4.x.md:228
|
||||
msgid "**Node 1**"
|
||||
msgstr "**节点 1**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:283
|
||||
#: ../../source/tutorials/models/GLM4.x.md:280
|
||||
msgid "Prefill-Decode Disaggregation"
|
||||
msgstr "Prefill-Decode 解耦部署"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:285
|
||||
#: ../../source/tutorials/models/GLM4.x.md:282
|
||||
msgid ""
|
||||
"We'd like to show the deployment guide of `GLM4.7` on multi-node "
|
||||
"environment with 2P1D for better performance."
|
||||
msgstr "我们将展示 `GLM4.7` 在多节点环境(2P1D)下的部署指南,以获得更好的性能。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:287
|
||||
#: ../../source/tutorials/models/GLM4.x.md:284
|
||||
msgid "Before you start, please"
|
||||
msgstr "在开始之前,请"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:289
|
||||
#: ../../source/tutorials/models/GLM4.x.md:286
|
||||
msgid "prepare the script `launch_online_dp.py` on each node:"
|
||||
msgstr "在每个节点上准备脚本 `launch_online_dp.py`:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:392
|
||||
#: ../../source/tutorials/models/GLM4.x.md:389
|
||||
msgid "prepare the script `run_dp_template.sh` on each node."
|
||||
msgstr "在每个节点上准备脚本 `run_dp_template.sh`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:394
|
||||
#: ../../source/tutorials/models/GLM4.x.md:669
|
||||
#: ../../source/tutorials/models/GLM4.x.md:391
|
||||
#: ../../source/tutorials/models/GLM4.x.md:664
|
||||
msgid "Prefill node 0"
|
||||
msgstr "Prefill 节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:460
|
||||
#: ../../source/tutorials/models/GLM4.x.md:676
|
||||
#: ../../source/tutorials/models/GLM4.x.md:456
|
||||
#: ../../source/tutorials/models/GLM4.x.md:671
|
||||
msgid "Prefill node 1"
|
||||
msgstr "Prefill 节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:525
|
||||
#: ../../source/tutorials/models/GLM4.x.md:683
|
||||
#: ../../source/tutorials/models/GLM4.x.md:520
|
||||
#: ../../source/tutorials/models/GLM4.x.md:678
|
||||
msgid "Decode node 0"
|
||||
msgstr "Decode 节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:596
|
||||
#: ../../source/tutorials/models/GLM4.x.md:690
|
||||
#: ../../source/tutorials/models/GLM4.x.md:591
|
||||
#: ../../source/tutorials/models/GLM4.x.md:685
|
||||
msgid "Decode node 1"
|
||||
msgstr "Decode 节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:667
|
||||
#: ../../source/tutorials/models/GLM4.x.md:662
|
||||
msgid ""
|
||||
"Once the preparation is done, you can start the server with the following"
|
||||
" command on each node:"
|
||||
msgstr "准备工作完成后,您可以在每个节点上使用以下命令启动服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:697
|
||||
#: ../../source/tutorials/models/GLM4.x.md:692
|
||||
msgid "Request Forwarding"
|
||||
msgstr "请求转发"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:699
|
||||
#: ../../source/tutorials/models/GLM4.x.md:694
|
||||
msgid ""
|
||||
"To set up request forwarding, run the following script on any machine. "
|
||||
"You can get the proxy program in the repository's examples: "
|
||||
"[load_balance_proxy_server_example.py](https://github.com/vllm-project"
|
||||
"/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:728
|
||||
#: ../../source/tutorials/models/GLM4.x.md:723
|
||||
msgid "Functional Verification"
|
||||
msgstr "功能验证"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:730
|
||||
#: ../../source/tutorials/models/GLM4.x.md:725
|
||||
msgid "Once your server is started, you can query the model with input prompts:"
|
||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:749
|
||||
#: ../../source/tutorials/models/GLM4.x.md:744
|
||||
msgid "Accuracy Evaluation"
|
||||
msgstr "精度评估"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:751
|
||||
#: ../../source/tutorials/models/GLM4.x.md:746
|
||||
msgid "Here are two accuracy evaluation methods."
|
||||
msgstr "这里有两种精度评估方法。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:753
|
||||
#: ../../source/tutorials/models/GLM4.x.md:770
|
||||
#: ../../source/tutorials/models/GLM4.x.md:748
|
||||
#: ../../source/tutorials/models/GLM4.x.md:765
|
||||
msgid "Using AISBench"
|
||||
msgstr "使用 AISBench"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:755
|
||||
#: ../../source/tutorials/models/GLM4.x.md:750
|
||||
msgid ""
|
||||
"Refer to [Using "
|
||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||
"details."
|
||||
msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:757
|
||||
#: ../../source/tutorials/models/GLM4.x.md:752
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `GLM4.7` "
|
||||
"in `vllm-ascend:main` (after `vllm-ascend:0.14.0rc1`) for reference only."
|
||||
msgstr "执行后,您可以获得结果,以下是 `GLM4.7` 在 `vllm-ascend:main`(`vllm-ascend:0.14.0rc1` 之后)中的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果,以下是 `GLM4.7` 在 `vllm-ascend:main`(`vllm-ascend:0.14.0rc1` "
|
||||
"之后)中的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:87
|
||||
msgid "dataset"
|
||||
@@ -389,111 +423,111 @@ msgstr "MATH500"
|
||||
msgid "98.8"
|
||||
msgstr "98.8"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:764
|
||||
#: ../../source/tutorials/models/GLM4.x.md:759
|
||||
msgid "Using Language Model Evaluation Harness"
|
||||
msgstr "使用语言模型评估工具"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:766
|
||||
#: ../../source/tutorials/models/GLM4.x.md:761
|
||||
msgid "Not tested yet."
|
||||
msgstr "尚未测试。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:768
|
||||
#: ../../source/tutorials/models/GLM4.x.md:763
|
||||
msgid "Performance"
|
||||
msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:772
|
||||
#: ../../source/tutorials/models/GLM4.x.md:767
|
||||
msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr ""
|
||||
"详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
"详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md"
|
||||
"#execute-performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:774
|
||||
#: ../../source/tutorials/models/GLM4.x.md:769
|
||||
msgid "Using vLLM Benchmark"
|
||||
msgstr "使用vLLM基准测试"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:776
|
||||
#: ../../source/tutorials/models/GLM4.x.md:771
|
||||
msgid "Run performance evaluation of `GLM-4.x` as an example."
|
||||
msgstr "以运行 `GLM-4.x` 的性能评估为例。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:778
|
||||
#: ../../source/tutorials/models/GLM4.x.md:773
|
||||
msgid ""
|
||||
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
||||
"for more details."
|
||||
msgstr ""
|
||||
"更多详情请参考 [vllm基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
msgstr "更多详情请参考 [vllm基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:780
|
||||
#: ../../source/tutorials/models/GLM4.x.md:775
|
||||
msgid "There are three `vllm bench` subcommands:"
|
||||
msgstr "`vllm bench` 包含三个子命令:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:782
|
||||
#: ../../source/tutorials/models/GLM4.x.md:777
|
||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||
msgstr "`latency`:基准测试单批次请求的延迟。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:783
|
||||
#: ../../source/tutorials/models/GLM4.x.md:778
|
||||
msgid "`serve`: Benchmark the online serving throughput."
|
||||
msgstr "`serve`:基准测试在线服务吞吐量。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:784
|
||||
#: ../../source/tutorials/models/GLM4.x.md:779
|
||||
msgid "`throughput`: Benchmark offline inference throughput."
|
||||
msgstr "`throughput`:基准测试离线推理吞吐量。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:786
|
||||
#: ../../source/tutorials/models/GLM4.x.md:781
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
msgstr "以 `serve` 为例,运行以下代码。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:808
|
||||
#: ../../source/tutorials/models/GLM4.x.md:803
|
||||
msgid ""
|
||||
"After about several minutes, you can get the performance evaluation "
|
||||
"result."
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:810
|
||||
#: ../../source/tutorials/models/GLM4.x.md:805
|
||||
msgid "Best Practices"
|
||||
msgstr "最佳实践"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:812
|
||||
#: ../../source/tutorials/models/GLM4.x.md:807
|
||||
msgid "In this chapter, we recommend best practices for three scenarios:"
|
||||
msgstr "本章节,我们针对三种场景推荐最佳实践:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:814
|
||||
#: ../../source/tutorials/models/GLM4.x.md:809
|
||||
msgid ""
|
||||
"Long-context: For long sequences with low concurrency (≤ 4): set `dp1 "
|
||||
"tp16`; For long sequences with high concurrency (> 4): set `dp2 tp8`"
|
||||
msgstr ""
|
||||
"长上下文:对于低并发(≤ 4)的长序列,设置 `dp1 tp16`;对于高并发(> 4)的长序列,设置 `dp2 tp8`"
|
||||
msgstr "长上下文:对于低并发(≤ 4)的长序列,设置 `dp1 tp16`;对于高并发(> 4)的长序列,设置 `dp2 tp8`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:815
|
||||
#: ../../source/tutorials/models/GLM4.x.md:810
|
||||
msgid ""
|
||||
"Low-latency: For short sequences with low latency: we recommend setting "
|
||||
"`dp2 tp8`"
|
||||
msgstr "低延迟:对于需要低延迟的短序列,我们推荐设置 `dp2 tp8`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:816
|
||||
#: ../../source/tutorials/models/GLM4.x.md:811
|
||||
msgid ""
|
||||
"High-throughput: For short sequences with high throughput: we also "
|
||||
"recommend setting `dp2 tp8`"
|
||||
msgstr "高吞吐量:对于需要高吞吐量的短序列,我们也推荐设置 `dp2 tp8`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:818
|
||||
#: ../../source/tutorials/models/GLM4.x.md:813
|
||||
msgid ""
|
||||
"**Notice:** `max-model-len` and `max-num-seqs` need to be set according "
|
||||
"to the actual usage scenario. For other settings, please refer to the "
|
||||
"**[Deployment](#deployment)** chapter."
|
||||
msgstr ""
|
||||
"**注意:** `max-model-len` 和 `max-num-seqs` 需要根据实际使用场景进行设置。其他设置请参考 **[部署](#deployment)** 章节。"
|
||||
"**注意:** `max-model-len` 和 `max-num-seqs` 需要根据实际使用场景进行设置。其他设置请参考 "
|
||||
"**[部署](#deployment)** 章节。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:821
|
||||
#: ../../source/tutorials/models/GLM4.x.md:816
|
||||
msgid "FAQ"
|
||||
msgstr "常见问题"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:823
|
||||
#: ../../source/tutorials/models/GLM4.x.md:818
|
||||
msgid "**Q: Why is the TPOT performance poor in Long-context test?**"
|
||||
msgstr "**问:为什么在长上下文测试中TPOT性能不佳?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:825
|
||||
#: ../../source/tutorials/models/GLM4.x.md:820
|
||||
msgid ""
|
||||
"A: Please ensure that the FIA operator replacement script has been "
|
||||
"executed successfully to complete the replacement of FIA operators. Here "
|
||||
@@ -501,28 +535,28 @@ msgid ""
|
||||
"[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh) and"
|
||||
" [A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)"
|
||||
msgstr ""
|
||||
"答:请确保已成功执行FIA算子替换脚本以完成FIA算子的替换。脚本如下:"
|
||||
"[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh) 和 "
|
||||
"[A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)"
|
||||
"答:请确保已成功执行FIA算子替换脚本以完成FIA算子的替换。脚本如下:[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh)"
|
||||
" 和 [A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:827
|
||||
#: ../../source/tutorials/models/GLM4.x.md:822
|
||||
msgid ""
|
||||
"**Q: Startup fails with HCCL port conflicts (address already bound). What"
|
||||
" should I do?**"
|
||||
msgstr "**问:启动失败,提示HCCL端口冲突(地址已被占用)。我该怎么办?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:829
|
||||
#: ../../source/tutorials/models/GLM4.x.md:824
|
||||
msgid "A: Clean up old processes and restart: `pkill -f VLLM*`."
|
||||
msgstr "答:清理旧进程并重启:`pkill -f VLLM*`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:831
|
||||
#: ../../source/tutorials/models/GLM4.x.md:826
|
||||
msgid "**Q: How to handle OOM or unstable startup?**"
|
||||
msgstr "**问:如何处理OOM或启动不稳定的问题?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:833
|
||||
#: ../../source/tutorials/models/GLM4.x.md:828
|
||||
msgid ""
|
||||
"A: Reduce `--max-num-seqs` and `--max-model-len` first. If needed, reduce"
|
||||
" concurrency and load-testing pressure (e.g., `max-concurrency` / `num-"
|
||||
"prompts`)."
|
||||
msgstr ""
|
||||
"答:首先减少 `--max-num-seqs` 和 `--max-model-len`。如有需要,降低并发度和负载测试压力(例如,`max-concurrency` / `num-prompts`)。"
|
||||
"答:首先减少 `--max-num-seqs` 和 `--max-model-len`。如有需要,降低并发度和负载测试压力(例如,`max-"
|
||||
"concurrency` / `num-prompts`)。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -30,10 +30,11 @@ msgstr "简介"
|
||||
#: ../../source/tutorials/models/GLM5.md:5
|
||||
msgid ""
|
||||
"[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts "
|
||||
"(MoE) architecture and targeting at complex systems engineering and long-"
|
||||
"(MoE) architecture and targets at complex systems engineering and long-"
|
||||
"horizon agentic tasks."
|
||||
msgstr ""
|
||||
"[GLM-5](https://huggingface.co/zai-org/GLM-5) 采用混合专家 (Mixture-of-Experts, MoE) 架构,旨在处理复杂系统工程和长视野智能体任务。"
|
||||
"[GLM-5](https://huggingface.co/zai-org/GLM-5) 采用混合专家 (Mixture-of-Experts,"
|
||||
" MoE) 架构,旨在处理复杂系统工程和长视野智能体任务。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:7
|
||||
msgid ""
|
||||
@@ -41,7 +42,8 @@ msgid ""
|
||||
"`vllm-ascend:v0.17.0rc1` and `vllm-ascend:v0.18.0rc1` , the version of "
|
||||
"transformers need to be upgraded to 5.2.0."
|
||||
msgstr ""
|
||||
"`GLM-5` 模型首次在 `vllm-ascend:v0.17.0rc1` 版本中得到支持。在 `vllm-ascend:v0.17.0rc1` 和 `vllm-ascend:v0.18.0rc1` 版本中,需要将 transformers 的版本升级到 5.2.0。"
|
||||
"`GLM-5` 模型首次在 `vllm-ascend:v0.17.0rc1` 版本中得到支持。在 `vllm-ascend:v0.17.0rc1`"
|
||||
" 和 `vllm-ascend:v0.18.0rc1` 版本中,需要将 transformers 的版本升级到 5.2.0。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:9
|
||||
msgid ""
|
||||
@@ -49,8 +51,7 @@ msgid ""
|
||||
"including supported features, feature configuration, environment "
|
||||
"preparation, single-node and multi-node deployment, accuracy and "
|
||||
"performance evaluation."
|
||||
msgstr ""
|
||||
"本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。"
|
||||
msgstr "本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:11
|
||||
msgid "Supported Features"
|
||||
@@ -61,15 +62,13 @@ msgid ""
|
||||
"Refer to [supported "
|
||||
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
||||
" model's supported feature matrix."
|
||||
msgstr ""
|
||||
"请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
msgstr "请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:15
|
||||
msgid ""
|
||||
"Refer to [feature guide](../../user_guide/feature_guide/index.md) to get "
|
||||
"the feature's configuration."
|
||||
msgstr ""
|
||||
"请参考[特性指南](../../user_guide/feature_guide/index.md)以获取特性的配置方法。"
|
||||
msgstr "请参考[特性指南](../../user_guide/feature_guide/index.md)以获取特性的配置方法。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:17
|
||||
msgid "Environment Preparation"
|
||||
@@ -84,35 +83,34 @@ msgid ""
|
||||
"`GLM-5`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5)."
|
||||
msgstr ""
|
||||
"`GLM-5` (BF16 版本): [下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-5)。"
|
||||
"`GLM-5` (BF16 版本): "
|
||||
"[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-5)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:22
|
||||
msgid ""
|
||||
"`GLM-5-w4a8`: [Download model weight](https://modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-5-w4a8)."
|
||||
msgstr ""
|
||||
"`GLM-5-w4a8`: [下载模型权重](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8)。"
|
||||
msgstr "`GLM-5-w4a8`: [下载模型权重](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:23
|
||||
msgid ""
|
||||
"`GLM-5-w8a8`: [Download model weight](https://www.modelscope.cn/models"
|
||||
"/Eco-Tech/GLM-5-w8a8)."
|
||||
msgstr ""
|
||||
"`GLM-5-w8a8`: [下载模型权重](https://www.modelscope.cn/models/Eco-Tech/GLM-5-w8a8)。"
|
||||
"`GLM-5-w8a8`: [下载模型权重](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-5-w8a8)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:24
|
||||
msgid ""
|
||||
"You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to "
|
||||
"quantify the model naively."
|
||||
msgstr ""
|
||||
"您可以使用 [msmodelslim](https://gitcode.com/Ascend/msmodelslim) 对模型进行简单的量化。"
|
||||
msgstr "您可以使用 [msmodelslim](https://gitcode.com/Ascend/msmodelslim) 对模型进行简单的量化。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:26
|
||||
msgid ""
|
||||
"It is recommended to download the model weight to the shared directory of"
|
||||
" multiple nodes, such as `/root/.cache/`"
|
||||
msgstr ""
|
||||
"建议将模型权重下载到多个节点的共享目录中,例如 `/root/.cache/`"
|
||||
msgstr "建议将模型权重下载到多个节点的共享目录中,例如 `/root/.cache/`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:28
|
||||
msgid "Installation"
|
||||
@@ -146,7 +144,8 @@ msgid ""
|
||||
"Install `vllm-ascend` from source, refer to "
|
||||
"[installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)."
|
||||
msgstr ""
|
||||
"从源码安装 `vllm-ascend`,请参考[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)。"
|
||||
"从源码安装 `vllm-"
|
||||
"ascend`,请参考[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:123
|
||||
msgid ""
|
||||
@@ -200,7 +199,9 @@ msgid ""
|
||||
"optimize inference efficiency. It allows non-blocking task scheduling to "
|
||||
"improve concurrency and throughput, especially when processing large-"
|
||||
"scale models."
|
||||
msgstr "`--async-scheduling` 异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,尤其是在处理大规模模型时。"
|
||||
msgstr ""
|
||||
"`--async-scheduling` "
|
||||
"异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,尤其是在处理大规模模型时。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:254
|
||||
msgid "Multi-node Deployment"
|
||||
@@ -211,7 +212,9 @@ msgid ""
|
||||
"If you want to deploy multi-node environment, you need to verify multi-"
|
||||
"node communication according to [verify multi-node communication "
|
||||
"environment](../../installation.md#verify-multi-node-communication)."
|
||||
msgstr "如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-communication)来验证多节点通信。"
|
||||
msgstr ""
|
||||
"如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||
"communication)来验证多节点通信。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:265
|
||||
msgid "`glm-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16)."
|
||||
@@ -240,7 +243,9 @@ msgid ""
|
||||
"For bf16 weight, use this script on each node to enable [Multi Token "
|
||||
"Prediction "
|
||||
"(MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md)."
|
||||
msgstr "对于 bf16 权重,在每个节点上使用此脚本来启用[多令牌预测 (MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md)。"
|
||||
msgstr ""
|
||||
"对于 bf16 权重,在每个节点上使用此脚本来启用[多令牌预测 "
|
||||
"(MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:526
|
||||
msgid "`glm-5-w8a8`: require 2 Atlas 800 A3 (64G × 16)."
|
||||
@@ -276,200 +281,221 @@ msgid ""
|
||||
"deployment, `layer_sharding` is supported only on prefill/P nodes with "
|
||||
"`kv_role=\"kv_producer\"`; do not enable it on decode/D nodes or "
|
||||
"`kv_role=\"kv_both\"` nodes."
|
||||
msgstr "为了在预填充阶段支持 200k 的上下文窗口,需要在每个预填充节点的 `--additional_config` 中添加参数 `\"layer_sharding\": [\"q_b_proj\"]`。在 PD 解耦部署中,`layer_sharding` 仅在 `kv_role=\"kv_producer\"` 的预填充/P 节点上受支持;不要在解码/D 节点或 `kv_role=\"kv_both\"` 的节点上启用它。"
|
||||
msgstr ""
|
||||
"为了在预填充阶段支持 200k 的上下文窗口,需要在每个预填充节点的 `--additional_config` 中添加参数 "
|
||||
"`\"layer_sharding\": [\"q_b_proj\"]`。在 PD 解耦部署中,`layer_sharding` 仅在 "
|
||||
"`kv_role=\"kv_producer\"` 的预填充/P 节点上受支持;不要在解码/D 节点或 `kv_role=\"kv_both\"`"
|
||||
" 的节点上启用它。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:747
|
||||
#: ../../source/tutorials/models/GLM5.md:1233
|
||||
#: ../../source/tutorials/models/GLM5.md:1231
|
||||
msgid "Prefill node 0"
|
||||
msgstr "预填充节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:826
|
||||
#: ../../source/tutorials/models/GLM5.md:1240
|
||||
#: ../../source/tutorials/models/GLM5.md:825
|
||||
#: ../../source/tutorials/models/GLM5.md:1238
|
||||
msgid "Prefill node 1"
|
||||
msgstr "预填充节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:906
|
||||
#: ../../source/tutorials/models/GLM5.md:1247
|
||||
#: ../../source/tutorials/models/GLM5.md:904
|
||||
#: ../../source/tutorials/models/GLM5.md:1245
|
||||
msgid "Decode node 0"
|
||||
msgstr "解码节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:988
|
||||
#: ../../source/tutorials/models/GLM5.md:1254
|
||||
#: ../../source/tutorials/models/GLM5.md:986
|
||||
#: ../../source/tutorials/models/GLM5.md:1252
|
||||
msgid "Decode node 1"
|
||||
msgstr "解码节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1069
|
||||
#: ../../source/tutorials/models/GLM5.md:1261
|
||||
#: ../../source/tutorials/models/GLM5.md:1067
|
||||
#: ../../source/tutorials/models/GLM5.md:1259
|
||||
msgid "Decode node 2"
|
||||
msgstr "解码节点 2"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1150
|
||||
#: ../../source/tutorials/models/GLM5.md:1268
|
||||
#: ../../source/tutorials/models/GLM5.md:1148
|
||||
#: ../../source/tutorials/models/GLM5.md:1266
|
||||
msgid "Decode node 3"
|
||||
msgstr "解码节点 3"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1231
|
||||
#: ../../source/tutorials/models/GLM5.md:1229
|
||||
msgid ""
|
||||
"Once the preparation is done, you can start the server with the following"
|
||||
" command on each node:"
|
||||
msgstr "准备工作完成后,您可以在每个节点上使用以下命令启动服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1275
|
||||
#: ../../source/tutorials/models/GLM5.md:1273
|
||||
msgid "Request Forwarding"
|
||||
msgstr "请求转发"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1277
|
||||
#: ../../source/tutorials/models/GLM5.md:1275
|
||||
msgid ""
|
||||
"To set up request forwarding, run the following script on any machine. "
|
||||
"You can get the proxy program in the repository's examples: "
|
||||
"[load_balance_proxy_server_example.py](https://github.com/vllm-project"
|
||||
"/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1318
|
||||
#: ../../source/tutorials/models/GLM5.md:1316
|
||||
msgid "**Notice:**"
|
||||
msgstr "**注意:**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1320
|
||||
#: ../../source/tutorials/models/GLM5.md:1318
|
||||
msgid "Some configurations for optimization are shown below:"
|
||||
msgstr "以下是一些用于优化的配置:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1322
|
||||
#: ../../source/tutorials/models/GLM5.md:1320
|
||||
msgid ""
|
||||
"`VLLM_ASCEND_ENABLE_FLASHCOMM1`: Enable FlashComm optimization to reduce "
|
||||
"communication and computation overhead on prefill node. With FlashComm "
|
||||
"enabled, layer_sharding list cannot include o_proj as an element."
|
||||
msgstr "`VLLM_ASCEND_ENABLE_FLASHCOMM1`: 启用 FlashComm 优化以减少预填充节点上的通信和计算开销。启用 FlashComm 后,layer_sharding 列表不能包含 o_proj 作为元素。"
|
||||
msgstr ""
|
||||
"`VLLM_ASCEND_ENABLE_FLASHCOMM1`: 启用 FlashComm 优化以减少预填充节点上的通信和计算开销。启用 "
|
||||
"FlashComm 后,layer_sharding 列表不能包含 o_proj 作为元素。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1323
|
||||
#: ../../source/tutorials/models/GLM5.md:1321
|
||||
msgid ""
|
||||
"`VLLM_ASCEND_ENABLE_FUSED_MC2`: Enable following fused operators: "
|
||||
"dispatch_gmm_combine_decode and dispatch_ffn_combine operator."
|
||||
msgstr "`VLLM_ASCEND_ENABLE_FUSED_MC2`: 启用以下融合算子:dispatch_gmm_combine_decode 和 dispatch_ffn_combine 算子。"
|
||||
"dispatch_gmm_combine_decode and dispatch_ffn_combine operator. and please"
|
||||
" **note** that this environment variable can only be enabled on decode "
|
||||
"nodes."
|
||||
msgstr ""
|
||||
"`VLLM_ASCEND_ENABLE_FUSED_MC2`: 启用以下融合算子:dispatch_gmm_combine_decode 和 "
|
||||
"dispatch_ffn_combine 算子。并请**注意**,此环境变量仅可在解码节点上启用。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1324
|
||||
#: ../../source/tutorials/models/GLM5.md:1322
|
||||
msgid "`VLLM_ASCEND_ENABLE_MLAPO`: Enable fused operator MlaPreprocessOperation."
|
||||
msgstr "`VLLM_ASCEND_ENABLE_MLAPO`: 启用融合算子 MlaPreprocessOperation。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1326
|
||||
#: ../../source/tutorials/models/GLM5.md:1324
|
||||
msgid ""
|
||||
"Please refer to the following python file for further explanation and "
|
||||
"restrictions of the environment variables above: "
|
||||
"[envs.py](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/vllm_ascend/envs.py)"
|
||||
msgstr "有关上述环境变量的进一步解释和限制,请参考以下 python 文件:[envs.py](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/envs.py)"
|
||||
msgstr ""
|
||||
"有关上述环境变量的进一步解释和限制,请参考以下 python 文件:[envs.py](https://github.com/vllm-"
|
||||
"project/vllm-ascend/blob/main/vllm_ascend/envs.py)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1328
|
||||
#: ../../source/tutorials/models/GLM5.md:1326
|
||||
msgid "Functional Verification"
|
||||
msgstr "功能验证"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1330
|
||||
#: ../../source/tutorials/models/GLM5.md:1328
|
||||
msgid "Once your server is started, you can query the model with input prompts:"
|
||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1343
|
||||
#: ../../source/tutorials/models/GLM5.md:1341
|
||||
msgid "Accuracy Evaluation"
|
||||
msgstr "精度评估"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1345
|
||||
#: ../../source/tutorials/models/GLM5.md:1343
|
||||
msgid "Here are two accuracy evaluation methods."
|
||||
msgstr "以下是两种精度评估方法。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1347
|
||||
#: ../../source/tutorials/models/GLM5.md:1359
|
||||
#: ../../source/tutorials/models/GLM5.md:1345
|
||||
#: ../../source/tutorials/models/GLM5.md:1357
|
||||
msgid "Using AISBench"
|
||||
msgstr "使用AISBench"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1349
|
||||
#: ../../source/tutorials/models/GLM5.md:1347
|
||||
msgid ""
|
||||
"Refer to [Using "
|
||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||
"details."
|
||||
msgstr "详情请参考[使用AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1351
|
||||
#: ../../source/tutorials/models/GLM5.md:1349
|
||||
msgid "After execution, you can get the result."
|
||||
msgstr "执行后,您将获得结果。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1353
|
||||
#: ../../source/tutorials/models/GLM5.md:1351
|
||||
msgid "Using Language Model Evaluation Harness"
|
||||
msgstr "使用Language Model Evaluation Harness"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1355
|
||||
#: ../../source/tutorials/models/GLM5.md:1353
|
||||
msgid "Not tested yet."
|
||||
msgstr "尚未测试。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1357
|
||||
#: ../../source/tutorials/models/GLM5.md:1355
|
||||
msgid "Performance"
|
||||
msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1361
|
||||
#: ../../source/tutorials/models/GLM5.md:1359
|
||||
msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md"
|
||||
"#execute-performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1363
|
||||
#: ../../source/tutorials/models/GLM5.md:1361
|
||||
msgid "Using vLLM Benchmark"
|
||||
msgstr "使用vLLM基准测试"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1365
|
||||
#: ../../source/tutorials/models/GLM5.md:1363
|
||||
msgid ""
|
||||
"Refer to [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) "
|
||||
"for more details."
|
||||
msgstr "更多详情请参考[vllm基准测试](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1367
|
||||
#: ../../source/tutorials/models/GLM5.md:1365
|
||||
msgid "Best Practices"
|
||||
msgstr "最佳实践"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1369
|
||||
#: ../../source/tutorials/models/GLM5.md:1367
|
||||
msgid ""
|
||||
"In this chapter, we recommend best practices in prefill-decode "
|
||||
"disaggregation scenario with 1P1D architecture using 4 Atlas 800 A3 (64G "
|
||||
"× 16):"
|
||||
msgstr "本章节,我们推荐在使用4台Atlas 800 A3(64G × 16)的1P1D架构下,预填充-解码分离场景的最佳实践:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1371
|
||||
#: ../../source/tutorials/models/GLM5.md:1369
|
||||
msgid ""
|
||||
"Low-latency: We recommend setting `dp4 tp8` on prefill nodes and `dp4 "
|
||||
"tp8` on decode nodes for low latency situation."
|
||||
msgstr "低延迟场景:对于低延迟场景,我们建议在预填充节点上设置`dp4 tp8`,在解码节点上设置`dp4 tp8`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1372
|
||||
#: ../../source/tutorials/models/GLM5.md:1370
|
||||
msgid ""
|
||||
"High-throughput: `dp4 tp8` on prefill nodes and `dp8 tp4` on decode nodes"
|
||||
" is recommended for high throughput situation."
|
||||
msgstr "高吞吐场景:对于高吞吐场景,建议在预填充节点上设置`dp4 tp8`,在解码节点上设置`dp8 tp4`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1374
|
||||
#: ../../source/tutorials/models/GLM5.md:1372
|
||||
msgid ""
|
||||
"**Notice:** `max-model-len` and `max-num-seqs` need to be set according "
|
||||
"to the actual usage scenario. For other settings, please refer to the "
|
||||
"**[Deployment](#deployment)** chapter."
|
||||
msgstr "**注意:** `max-model-len`和`max-num-seqs`需要根据实际使用场景进行设置。其他设置请参考**[部署](#deployment)**章节。"
|
||||
msgstr ""
|
||||
"**注意:** `max-model-len`和`max-num-"
|
||||
"seqs`需要根据实际使用场景进行设置。其他设置请参考**[部署](#deployment)**章节。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1377
|
||||
#: ../../source/tutorials/models/GLM5.md:1375
|
||||
msgid "FAQ"
|
||||
msgstr "常见问题"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1379
|
||||
#: ../../source/tutorials/models/GLM5.md:1377
|
||||
msgid ""
|
||||
"**Q: How to solve ValueError: Tokenizer class TokenizersBackend does not "
|
||||
"exist or is not currently imported?**"
|
||||
msgstr "**问:如何解决ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported?**"
|
||||
msgstr ""
|
||||
"**问:如何解决ValueError: Tokenizer class TokenizersBackend does not exist or "
|
||||
"is not currently imported?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1381
|
||||
#: ../../source/tutorials/models/GLM5.md:1379
|
||||
msgid "A: Please update the version of transformers to 5.2.0"
|
||||
msgstr "答:请将transformers版本更新至5.2.0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1383
|
||||
#: ../../source/tutorials/models/GLM5.md:1381
|
||||
msgid "**Q: How to enable function calling for GLM-5?**"
|
||||
msgstr "**问:如何为GLM-5启用函数调用功能?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1385
|
||||
#: ../../source/tutorials/models/GLM5.md:1383
|
||||
msgid "A: Please add following configurations in vLLM startup command"
|
||||
msgstr "答:请在vLLM启动命令中添加以下配置"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,7 +35,9 @@ msgid ""
|
||||
"resolution visual encoder with the ERNIE-4.5-0.3B language model to "
|
||||
"enable accurate element recognition."
|
||||
msgstr ""
|
||||
"PaddleOCR-VL 是一款专为文档解析设计的 SOTA 且资源高效的模型。其核心组件是 PaddleOCR-VL-0.9B,一个紧凑而强大的视觉语言模型(VLM),它集成了 NaViT 风格的动态分辨率视觉编码器和 ERNIE-4.5-0.3B 语言模型,以实现精确的元素识别。"
|
||||
"PaddleOCR-VL 是一款专为文档解析设计的 SOTA 且资源高效的模型。其核心组件是 PaddleOCR-"
|
||||
"VL-0.9B,一个紧凑而强大的视觉语言模型(VLM),它集成了 NaViT 风格的动态分辨率视觉编码器和 ERNIE-4.5-0.3B "
|
||||
"语言模型,以实现精确的元素识别。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:7
|
||||
msgid ""
|
||||
@@ -44,8 +46,7 @@ msgid ""
|
||||
"preparation, single-node deployment, and functional verification. It is "
|
||||
"designed to help users quickly complete model deployment and "
|
||||
"verification."
|
||||
msgstr ""
|
||||
"本文档提供了完整的模型部署和验证的详细工作流程,包括支持的特性、环境准备、单节点部署和功能验证。旨在帮助用户快速完成模型部署和验证。"
|
||||
msgstr "本文档提供了完整的模型部署和验证的详细工作流程,包括支持的特性、环境准备、单节点部署和功能验证。旨在帮助用户快速完成模型部署和验证。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:9
|
||||
msgid "Supported Features"
|
||||
@@ -56,8 +57,7 @@ msgid ""
|
||||
"Refer to [supported "
|
||||
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
||||
" model's supported feature matrix."
|
||||
msgstr ""
|
||||
"请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
msgstr "请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:13
|
||||
msgid ""
|
||||
@@ -78,7 +78,8 @@ msgid ""
|
||||
"`PaddleOCR-VL-0.9B`: [PaddleOCR-"
|
||||
"VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)"
|
||||
msgstr ""
|
||||
"`PaddleOCR-VL-0.9B`: [PaddleOCR-VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)"
|
||||
"`PaddleOCR-VL-0.9B`: [PaddleOCR-"
|
||||
"VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:21
|
||||
msgid ""
|
||||
@@ -99,13 +100,15 @@ msgid ""
|
||||
"Select an image based on your machine type and start the docker image on "
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr "根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-docker)。"
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:51
|
||||
msgid ""
|
||||
"The 310P device is supported from version 0.15.0rc1. You need to select "
|
||||
"the corresponding image for installation."
|
||||
msgstr "310P 设备从版本 0.15.0rc1 开始支持。您需要选择对应的镜像进行安装。"
|
||||
"The Atlas 300 inference products are supported from version 0.15.0rc1. "
|
||||
"You need to select the corresponding image for installation."
|
||||
msgstr "Atlas 300 推理产品从版本 0.15.0rc1 开始支持。您需要选择对应的镜像进行安装。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:54
|
||||
msgid "Deployment"
|
||||
@@ -122,8 +125,9 @@ msgstr "单 NPU (PaddleOCR-VL)"
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:60
|
||||
msgid ""
|
||||
"PaddleOCR-VL supports single-node single-card deployment on the 910B4 and"
|
||||
" 310P platform. Follow these steps to start the inference service:"
|
||||
msgstr "PaddleOCR-VL 支持在 910B4 和 310P 平台上进行单节点单卡部署。请按照以下步骤启动推理服务:"
|
||||
" Atlas 300 inference products platform. Follow these steps to start the "
|
||||
"inference service:"
|
||||
msgstr "PaddleOCR-VL 支持在 910B4 和 Atlas 300 推理产品平台上进行单节点单卡部署。请按照以下步骤启动推理服务:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:62
|
||||
msgid ""
|
||||
@@ -144,18 +148,20 @@ msgid "Run the following script to start the vLLM server on single 910B4:"
|
||||
msgstr "运行以下脚本在单张 910B4 上启动 vLLM 服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md
|
||||
msgid "310P"
|
||||
msgstr "310P"
|
||||
msgid "Atlas 300 inference products"
|
||||
msgstr "Atlas 300 推理产品"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:97
|
||||
msgid "Run the following script to start the vLLM server on single 310P:"
|
||||
msgstr "运行以下脚本在单张 310P 上启动 vLLM 服务器:"
|
||||
msgid ""
|
||||
"Run the following script to start the vLLM server on single Atlas 300 "
|
||||
"inference products:"
|
||||
msgstr "运行以下脚本在单张 Atlas 300 推理产品上启动 vLLM 服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:116
|
||||
msgid ""
|
||||
"The `--max_model_len` option is added to prevent errors when generating "
|
||||
"the attention operator mask on the 310P device."
|
||||
msgstr "添加 `--max_model_len` 选项是为了防止在 310P 设备上生成注意力算子掩码时出错。"
|
||||
"the attention operator mask on the Atlas 300 inference products."
|
||||
msgstr "添加 `--max_model_len` 选项是为了防止在 Atlas 300 推理产品上生成注意力算子掩码时出错。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:121
|
||||
msgid "Multiple NPU (PaddleOCR-VL)"
|
||||
@@ -204,7 +210,9 @@ msgid ""
|
||||
"DocLayoutV2 model to fully unleash the capabilities of the PaddleOCR-VL "
|
||||
"model, making it more consistent with the examples provided by the "
|
||||
"official PaddlePaddle documentation."
|
||||
msgstr "在上面的示例中,我们演示了如何使用 vLLM 推理 PaddleOCR-VL-0.9B 模型。通常,我们还需要集成 PP-DocLayoutV2 模型,以充分发挥 PaddleOCR-VL 模型的能力,使其更符合官方 PaddlePaddle 文档提供的示例。"
|
||||
msgstr ""
|
||||
"在上面的示例中,我们演示了如何使用 vLLM 推理 PaddleOCR-VL-0.9B 模型。通常,我们还需要集成 PP-DocLayoutV2 "
|
||||
"模型,以充分发挥 PaddleOCR-VL 模型的能力,使其更符合官方 PaddlePaddle 文档提供的示例。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:205
|
||||
msgid ""
|
||||
@@ -230,11 +238,13 @@ msgstr "使用以下命令启动容器:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:235
|
||||
msgid ""
|
||||
"Install "
|
||||
"Install "
|
||||
"[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined)"
|
||||
" and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
" and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
msgstr ""
|
||||
"安装 [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined) 和 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
"安装 "
|
||||
"[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined)"
|
||||
" 和 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:246
|
||||
msgid "The OpenCV component may be missing:"
|
||||
@@ -252,11 +262,14 @@ msgstr "OM 推理"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:264
|
||||
msgid ""
|
||||
"The 310P device supports only the OM model inference. For details about "
|
||||
"the process, see the guide provided in "
|
||||
"The Atlas 300 inference products support only the OM model inference. For"
|
||||
" details about the process, see the guide provided in "
|
||||
"[ModelZoo](https://gitcode.com/Ascend/ModelZoo-"
|
||||
"PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2)."
|
||||
msgstr "310P 设备仅支持 OM 模型推理。有关该过程的详细信息,请参阅 [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2) 中提供的指南。"
|
||||
msgstr ""
|
||||
"Atlas 300 推理产品仅支持 OM 模型推理。有关该过程的详细信息,请参阅 [ModelZoo](https://gitcode.com/Ascend"
|
||||
"/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2) "
|
||||
"中提供的指南。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:268
|
||||
msgid ""
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -51,7 +51,8 @@ msgid ""
|
||||
"demonstration, showcasing the `Qwen3-VL-8B-Instruct` model as an example "
|
||||
"for single NPU deployment and the `Qwen2.5-VL-32B-Instruct` model as an "
|
||||
"example for multi-NPU deployment."
|
||||
msgstr "本教程使用 vLLM-Ascend `v0.11.0rc3-a3` 版本进行演示,以 `Qwen3-VL-8B-Instruct` 模型为例展示单NPU部署,以 `Qwen2.5-VL-32B-Instruct` 模型为例展示多NPU部署。"
|
||||
msgstr ""
|
||||
"本教程使用 vLLM-Ascend `v0.11.0rc3-a3` 版本进行演示,以 `Qwen3-VL-8B-Instruct` 模型为例展示单NPU部署,以 `Qwen2.5-VL-32B-Instruct` 模型为例展示多NPU部署。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:11
|
||||
msgid "Supported Features"
|
||||
@@ -86,56 +87,65 @@ msgstr "需要 1 个 Atlas 800I A2 (64G × 8) 节点或 1 个 Atlas 800 A3 (64G
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-3B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-3B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-3B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:24
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-7B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-7B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-7B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:25
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-32B-Instruct`:[Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-32B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-32B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:26
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-72B-Instruct`:[Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-72B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-72B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:27
|
||||
msgid ""
|
||||
"`Qwen3-VL-2B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)"
|
||||
msgstr "`Qwen3-VL-2B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-2B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:28
|
||||
msgid ""
|
||||
"`Qwen3-VL-4B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)"
|
||||
msgstr "`Qwen3-VL-4B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-4B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:29
|
||||
msgid ""
|
||||
"`Qwen3-VL-8B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)"
|
||||
msgstr "`Qwen3-VL-8B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-8B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:30
|
||||
msgid ""
|
||||
"`Qwen3-VL-32B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)"
|
||||
msgstr "`Qwen3-VL-32B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-32B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:32
|
||||
msgid ""
|
||||
"A sample Qwen2.5-VL quantization script can be found in the modelslim "
|
||||
"code repository. [Qwen2.5-VL Quantization Script "
|
||||
"Example](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)"
|
||||
msgstr "可以在 modelslim 代码仓库中找到 Qwen2.5-VL 的量化脚本示例。[Qwen2.5-VL 量化脚本示例](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)"
|
||||
msgstr ""
|
||||
"可以在 modelslim 代码仓库中找到 Qwen2.5-VL 的量化脚本示例。[Qwen2.5-VL 量化脚本示例](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:34
|
||||
msgid ""
|
||||
@@ -172,8 +182,7 @@ msgid ""
|
||||
"memory. You can find more details "
|
||||
"[<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html)."
|
||||
msgstr ""
|
||||
"`max_split_size_mb` 可防止原生分配器拆分大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽前完成。您可以在"
|
||||
"[<u>此处</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html)找到更多详细信息。"
|
||||
"`max_split_size_mb` 可防止原生分配器拆分大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽前完成。您可以在[<u>此处</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html)找到更多详细信息。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:115
|
||||
msgid "Deployment"
|
||||
@@ -217,10 +226,10 @@ msgid ""
|
||||
"Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-"
|
||||
"Instruct model's max seq len (256000) is larger than the maximum number "
|
||||
"of tokens that can be stored in KV cache. This will differ with different"
|
||||
" NPU series based on the HBM size. Please modify the value according to a"
|
||||
" suitable value for your NPU series."
|
||||
" NPU series based on the on-chip memory size. Please modify the value "
|
||||
"according to a suitable value for your NPU series."
|
||||
msgstr ""
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen3-VL-8B-Instruct 模型的最大序列长度(256000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的 HBM 大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen3-VL-8B-Instruct 模型的最大序列长度(256000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的片上内存大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:335
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:422
|
||||
@@ -253,10 +262,10 @@ msgid ""
|
||||
"Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-"
|
||||
"Instruct model's max_model_len (128000) is larger than the maximum number"
|
||||
" of tokens that can be stored in KV cache. This will differ with "
|
||||
"different NPU series base on the HBM size. Please modify the value "
|
||||
"according to a suitable value for your NPU series."
|
||||
"different NPU series base on the on-chip memory size. Please modify the "
|
||||
"value according to a suitable value for your NPU series."
|
||||
msgstr ""
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen2.5-VL-32B-Instruct 模型的最大模型长度(128000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的 HBM 大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen2.5-VL-32B-Instruct 模型的最大模型长度(128000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的片上内存大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:468
|
||||
msgid "Accuracy Evaluation"
|
||||
@@ -292,7 +301,8 @@ msgid ""
|
||||
"Refer to [Using "
|
||||
"lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more "
|
||||
"details on `lm_eval` installation."
|
||||
msgstr "有关 `lm_eval` 安装的更多详细信息,请参考[使用 lm_eval](../../developer_guide/evaluation/using_lm_eval.md)。"
|
||||
msgstr ""
|
||||
"有关 `lm_eval` 安装的更多详细信息,请参考[使用 lm_eval](../../developer_guide/evaluation/using_lm_eval.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:492
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:523
|
||||
@@ -315,7 +325,8 @@ msgstr "以 `mmmu_val` 数据集作为测试数据集为例,在离线模式下
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen2.5"
|
||||
"-VL-32B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only."
|
||||
msgstr "执行后,您将获得结果。以下是 `vllm-ascend:0.11.0rc3` 中 `Qwen2.5-VL-32B-Instruct` 的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您将获得结果。以下是 `vllm-ascend:0.11.0rc3` 中 `Qwen2.5-VL-32B-Instruct` 的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:543
|
||||
msgid "Performance"
|
||||
@@ -357,4 +368,4 @@ msgstr "性能评估必须在在线模式下进行。以 `serve` 为例。按如
|
||||
msgid ""
|
||||
"After about several minutes, you can get the performance evaluation "
|
||||
"result."
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,7 +35,8 @@ msgid ""
|
||||
"advancements in reasoning, instruction-following, agent capabilities, and"
|
||||
" multilingual support."
|
||||
msgstr ""
|
||||
"Qwen3 是 Qwen 系列最新一代的大语言模型,提供了一套完整的稠密模型和专家混合模型。基于广泛的训练,Qwen3 在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展。"
|
||||
"Qwen3 是 Qwen 系列最新一代的大语言模型,提供了一套完整的稠密模型和专家混合(MoE)模型。基于广泛的训练,Qwen3 "
|
||||
"在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:7
|
||||
msgid ""
|
||||
@@ -80,7 +81,9 @@ msgid ""
|
||||
"1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G × 8)nodes. [Download "
|
||||
"model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)"
|
||||
msgstr ""
|
||||
"`Qwen3-235B-A22B`(BF16 版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas 800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) 节点。[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)"
|
||||
"`Qwen3-235B-A22B`(BF16 版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas "
|
||||
"800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) "
|
||||
"节点。[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:22
|
||||
msgid ""
|
||||
@@ -89,7 +92,10 @@ msgid ""
|
||||
"8)nodes. [Download model weight](https://modelscope.cn/models/vllm-"
|
||||
"ascend/Qwen3-235B-A22B-W8A8)"
|
||||
msgstr ""
|
||||
"`Qwen3-235B-A22B-w8a8`(量化版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas 800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) 节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)"
|
||||
"`Qwen3-235B-A22B-w8a8`(量化版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas "
|
||||
"800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) "
|
||||
"节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-"
|
||||
"W8A8)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:24
|
||||
msgid ""
|
||||
@@ -106,7 +112,9 @@ msgid ""
|
||||
"If you want to deploy multi-node environment, you need to verify multi-"
|
||||
"node communication according to [verify multi-node communication "
|
||||
"environment](../../installation.md#verify-multi-node-communication)."
|
||||
msgstr "如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-communication)来验证多节点通信。"
|
||||
msgstr ""
|
||||
"如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||
"communication)来验证多节点通信。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:30
|
||||
msgid "Installation"
|
||||
@@ -121,14 +129,18 @@ msgid ""
|
||||
"For example, using images `quay.io/ascend/vllm-ascend:v0.11.0rc2`(for "
|
||||
"Atlas 800 A2) and `quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(for Atlas "
|
||||
"800 A3)."
|
||||
msgstr "例如,使用镜像 `quay.io/ascend/vllm-ascend:v0.11.0rc2`(适用于 Atlas 800 A2)和 `quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(适用于 Atlas 800 A3)。"
|
||||
msgstr ""
|
||||
"例如,使用镜像 `quay.io/ascend/vllm-ascend:v0.11.0rc2`(适用于 Atlas 800 A2)和 "
|
||||
"`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(适用于 Atlas 800 A3)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:38
|
||||
msgid ""
|
||||
"Select an image based on your machine type and start the docker image on "
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr "根据您的机器类型选择镜像并在节点上启动 Docker 容器,请参考[使用 Docker](../../installation.md#set-up-using-docker)。"
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 容器,请参考[使用 Docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md
|
||||
msgid "Build from source"
|
||||
@@ -142,7 +154,9 @@ msgstr "您可以从源码构建所有组件。"
|
||||
msgid ""
|
||||
"Install `vllm-ascend`, refer to [set up using "
|
||||
"python](../../installation.md#set-up-using-python)."
|
||||
msgstr "安装 `vllm-ascend`,请参考[使用 Python 设置](../../installation.md#set-up-using-python)。"
|
||||
msgstr ""
|
||||
"安装 `vllm-ascend`,请参考[使用 Python 设置](../../installation.md#set-up-using-"
|
||||
"python)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:84
|
||||
msgid ""
|
||||
@@ -163,7 +177,10 @@ msgid ""
|
||||
"`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 "
|
||||
"Atlas 800 A3(64G*16), 1 Atlas 800 A2(64G*8). Quantized version need to "
|
||||
"start with parameter `--quantization ascend`."
|
||||
msgstr "`Qwen3-235B-A22B` 和 `Qwen3-235B-A22B-w8a8` 都可以部署在 1 个 Atlas 800 A3(64G*16) 或 1 个 Atlas 800 A2(64G*8) 上。量化版本需要使用参数 `--quantization ascend` 启动。"
|
||||
msgstr ""
|
||||
"`Qwen3-235B-A22B` 和 `Qwen3-235B-A22B-w8a8` 都可以部署在 1 个 Atlas 800 "
|
||||
"A3(64G*16) 或 1 个 Atlas 800 A2(64G*8) 上。量化版本需要使用参数 `--quantization ascend`"
|
||||
" 启动。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:93
|
||||
msgid "Run the following script to execute online 128k inference."
|
||||
@@ -181,7 +198,10 @@ msgid ""
|
||||
"quantization weights to run long seqs (such as 128k context), it is "
|
||||
"required to use yarn rope-scaling technique."
|
||||
msgstr ""
|
||||
"[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) 原本仅支持 40960 上下文长度(max_position_embeddings)。如果您想使用它及其相关的量化权重来运行长序列(例如 128k 上下文),需要使用 yarn rope-scaling 技术。"
|
||||
"[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-"
|
||||
"long-texts) 原本仅支持 40960 "
|
||||
"上下文长度(max_position_embeddings)。如果您想使用它及其相关的量化权重来运行长序列(例如 128k 上下文),需要使用 "
|
||||
"yarn rope-scaling 技术。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:129
|
||||
#, python-brace-format
|
||||
@@ -192,7 +212,8 @@ msgid ""
|
||||
" \\`."
|
||||
msgstr ""
|
||||
"对于 `v0.12.0` 及以上版本的 vLLM,使用参数:`--hf-overrides '{\"rope_parameters\": "
|
||||
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}' \\`。"
|
||||
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
|
||||
" \\`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:130
|
||||
#, python-brace-format
|
||||
@@ -205,7 +226,10 @@ msgid ""
|
||||
"parameter."
|
||||
msgstr ""
|
||||
"对于 `v0.12.0` 以下版本的 vLLM,使用参数:`--rope_scaling "
|
||||
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}' \\`。如果您使用的是像 [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) 这样原本就支持长上下文的权重,则无需添加此参数。"
|
||||
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
|
||||
" \\`。如果您使用的是像 [Qwen3-235B-A22B-"
|
||||
"Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)"
|
||||
" 这样原本就支持长上下文的权重,则无需添加此参数。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:133
|
||||
msgid "The parameters are explained as follows:"
|
||||
@@ -215,7 +239,9 @@ msgstr "参数解释如下:"
|
||||
msgid ""
|
||||
"`--data-parallel-size` 1 and `--tensor-parallel-size` 8 are common "
|
||||
"settings for data parallelism (DP) and tensor parallelism (TP) sizes."
|
||||
msgstr "`--data-parallel-size` 1 和 `--tensor-parallel-size` 8 是数据并行(DP)和张量并行(TP)大小的常见设置。"
|
||||
msgstr ""
|
||||
"`--data-parallel-size` 1 和 `--tensor-parallel-size` 8 "
|
||||
"是数据并行(DP)和张量并行(TP)大小的常见设置。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:136
|
||||
msgid ""
|
||||
@@ -233,21 +259,28 @@ msgid ""
|
||||
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
||||
"`--data-parallel-size` >= the actual total concurrency."
|
||||
msgstr ""
|
||||
"`--max-num-seqs` 表示每个 DP 组允许处理的最大请求数。如果发送到服务的请求数超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= 实际总并发数。"
|
||||
"`--max-num-seqs` 表示每个 DP "
|
||||
"组允许处理的最大请求数。如果发送到服务的请求数超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT"
|
||||
" 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= "
|
||||
"实际总并发数。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:138
|
||||
msgid ""
|
||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||
"enables ChunkPrefill/SplitFuse by default, which means:"
|
||||
msgstr "`--max-num-batched-tokens` 表示模型在单步中可以处理的最大 token 数。目前,vLLM v1 调度默认启用 ChunkPrefill/SplitFuse,这意味着:"
|
||||
msgstr ""
|
||||
"`--max-num-batched-tokens` 表示模型在单步中可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||
"ChunkPrefill/SplitFuse,这意味着:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:139
|
||||
msgid ""
|
||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||
"tokens`, it will be divided into multiple rounds of computation according"
|
||||
" to `--max-num-batched-tokens`;"
|
||||
msgstr "(1) 如果一个请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens` 被分成多轮计算;"
|
||||
msgstr ""
|
||||
"(1) 如果一个请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-"
|
||||
"tokens` 被分成多轮计算;"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:140
|
||||
msgid ""
|
||||
@@ -277,14 +310,21 @@ msgid ""
|
||||
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
||||
"during actual inference. The default value is `0.9`."
|
||||
msgstr ""
|
||||
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache 大小。在预热阶段(在 vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可以使用的 kv_cache 就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理期间不同(例如,由于 EP 负载不均),将 `--gpu-memory-utilization` 设置得过高可能会导致实际推理期间出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
|
||||
"大小。在预热阶段(在 vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens`"
|
||||
" 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * "
|
||||
"HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可以使用的 kv_cache "
|
||||
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理期间不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||
"utilization` 设置得过高可能会导致实际推理期间出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:143
|
||||
msgid ""
|
||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||
"use pure EP or pure TP."
|
||||
msgstr "`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE 可以使用纯 EP 或纯 TP。"
|
||||
msgstr ""
|
||||
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||
"可以使用纯 EP 或纯 TP。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:144
|
||||
msgid ""
|
||||
@@ -308,7 +348,10 @@ msgid ""
|
||||
"mainly used to reduce the cost of operator dispatch. Currently, "
|
||||
"\"FULL_DECODE_ONLY\" is recommended."
|
||||
msgstr ""
|
||||
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和 \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 \"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 \"FULL_DECODE_ONLY\"。"
|
||||
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
|
||||
" \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 "
|
||||
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||
"\"FULL_DECODE_ONLY\"。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:148
|
||||
msgid ""
|
||||
@@ -319,14 +362,18 @@ msgid ""
|
||||
"Currently, the default setting is recommended. Only in some scenarios is "
|
||||
"it necessary to set this separately to achieve optimal performance."
|
||||
msgstr ""
|
||||
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
|
||||
"40,..., `--max-num-"
|
||||
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:149
|
||||
msgid ""
|
||||
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
|
||||
"optimization is enabled. Currently, this optimization is only supported "
|
||||
"for MoE in scenarios where tp_size > 1."
|
||||
msgstr "`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 tp_size > 1 的场景下对 MoE 支持。"
|
||||
msgstr ""
|
||||
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 "
|
||||
"tp_size > 1 的场景下对 MoE 支持。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:151
|
||||
msgid "Multi-node Deployment with MP (Recommended)"
|
||||
@@ -336,7 +383,9 @@ msgstr "使用 MP 进行多节点部署(推荐)"
|
||||
msgid ""
|
||||
"Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to "
|
||||
"deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes."
|
||||
msgstr "假设您有 Atlas 800 A3 (64G*16) 节点(或 2* A2),并希望跨多个节点部署 `Qwen3-VL-235B-A22B-Instruct` 模型。"
|
||||
msgstr ""
|
||||
"假设您有 Atlas 800 A3 (64G*16) 节点(或 2* A2),并希望跨多个节点部署 `Qwen3-VL-235B-A22B-"
|
||||
"Instruct` 模型。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:155
|
||||
msgid "Node 0"
|
||||
@@ -368,7 +417,9 @@ msgstr "预填充-解码分离"
|
||||
msgid ""
|
||||
"refer to [Prefill-Decode Disaggregation Mooncake Verification "
|
||||
"(Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)"
|
||||
msgstr "请参阅 [Prefill-Decode 分离部署 Mooncake 验证 (Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)"
|
||||
msgstr ""
|
||||
"请参阅 [Prefill-Decode 分离部署 Mooncake 验证 "
|
||||
"(Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:262
|
||||
msgid "Functional Verification"
|
||||
@@ -453,7 +504,10 @@ msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参阅 [使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参阅 [使用 AISBench "
|
||||
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:297
|
||||
msgid "Using vLLM Benchmark"
|
||||
@@ -542,13 +596,13 @@ msgstr "单节点 A3 (64G*16)"
|
||||
msgid "Example server scripts:"
|
||||
msgstr "服务器脚本示例:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:368
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:597
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:367
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:595
|
||||
msgid "Benchmark scripts:"
|
||||
msgstr "基准测试脚本:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:384
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:613
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:383
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:611
|
||||
msgid "Reference test results:"
|
||||
msgstr "参考测试结果:"
|
||||
|
||||
@@ -592,48 +646,53 @@ msgstr "48.69"
|
||||
msgid "2761.72"
|
||||
msgstr "2761.72"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:390
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:619
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:389
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:617
|
||||
msgid "Note:"
|
||||
msgstr "注意:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:392
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:391
|
||||
msgid ""
|
||||
"Setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` enables MoE fused "
|
||||
"operators that reduce time consumption of MoE in both prefill and decode."
|
||||
" This is an experimental feature which only supports W8A8 quantization on"
|
||||
" Atlas A3 servers now. If you encounter any problems when using this "
|
||||
"feature, you can disable it by setting `export "
|
||||
"VLLM_ASCEND_ENABLE_FUSED_MC2=0` and update issues in vLLM-Ascend "
|
||||
"community."
|
||||
msgstr "设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` 可启用 MoE 融合算子,以减少预填充和解码阶段 MoE 的时间消耗。这是一个实验性功能,目前仅支持 Atlas A3 服务器上的 W8A8 量化。如果您在使用此功能时遇到任何问题,可以通过设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` 来禁用它,并在 vLLM-Ascend 社区更新问题。"
|
||||
"operators that reduce time consumption of MoE in decode. This is an "
|
||||
"experimental feature which only supports W8A8 quantization on Atlas A3 "
|
||||
"servers now. If you encounter any problems when using this feature, you "
|
||||
"can disable it by setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` and "
|
||||
"update issues in vLLM-Ascend community. **Note** that this environment "
|
||||
"variable can only be enabled on decode nodes."
|
||||
msgstr ""
|
||||
"设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` 可启用 MoE 融合算子,以减少解码阶段 MoE "
|
||||
"的时间消耗。这是一个实验性功能,目前仅支持 Atlas A3 服务器上的 W8A8 量化。如果您在使用此功能时遇到任何问题,可以通过设置 "
|
||||
"`export VLLM_ASCEND_ENABLE_FUSED_MC2=0` 来禁用它,并在 vLLM-Ascend 社区更新问题。**注意**,此环境变量只能在解码节点上启用。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:393
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:392
|
||||
msgid ""
|
||||
"Here we disable prefix cache because of random datasets. You can enable "
|
||||
"prefix cache if requests have long common prefix."
|
||||
msgstr "由于使用随机数据集,此处我们禁用了前缀缓存。如果请求具有较长的公共前缀,您可以启用前缀缓存。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:395
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:394
|
||||
msgid "Three Node A3 -- PD disaggregation"
|
||||
msgstr "三节点 A3 -- PD 分离部署"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:397
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:396
|
||||
msgid ""
|
||||
"On three Atlas 800 A3(64G*16) server, we recommend to use one node as one"
|
||||
" prefill instance and two nodes as one decode instance. Example server "
|
||||
"scripts: Prefill Node 1"
|
||||
msgstr "在三台 Atlas 800 A3(64G*16) 服务器上,我们建议使用一个节点作为一个预填充实例,两个节点作为一个解码实例。服务器脚本示例:预填充节点 1"
|
||||
msgstr ""
|
||||
"在三台 Atlas 800 A3(64G*16) "
|
||||
"服务器上,我们建议使用一个节点作为一个预填充实例,两个节点作为一个解码实例。服务器脚本示例:预填充节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:462
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:460
|
||||
msgid "Decode Node 1"
|
||||
msgstr "解码节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:526
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:524
|
||||
msgid "Decode Node 2"
|
||||
msgstr "解码节点 2"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:591
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:589
|
||||
msgid "PD proxy:"
|
||||
msgstr "PD 代理:"
|
||||
|
||||
@@ -657,9 +716,13 @@ msgstr "52.07"
|
||||
msgid "8593.44"
|
||||
msgstr "8593.44"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:621
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:619
|
||||
msgid ""
|
||||
"We recommend to set `export VLLM_ASCEND_ENABLE_FUSED_MC2=2` on this "
|
||||
"scenario (typically EP32 for Qwen3-235B). This enables a different MoE "
|
||||
"fusion operator."
|
||||
msgstr "在此场景下(通常 Qwen3-235B 使用 EP32),我们建议设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=2`。这将启用一个不同的 MoE 融合算子。"
|
||||
"fusion operator. **Note** that this environment variable can only be "
|
||||
"enabled on decode nodes."
|
||||
msgstr ""
|
||||
"在此场景下(通常 Qwen3-235B 使用 EP32),我们建议设置 `export "
|
||||
"VLLM_ASCEND_ENABLE_FUSED_MC2=2`。这将启用一个不同的 MoE 融合算子。"
|
||||
"**注意**:此环境变量只能在解码节点上启用。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,17 +29,15 @@ msgstr "简介"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:5
|
||||
msgid ""
|
||||
"Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation "
|
||||
"models. It processes text, images, audio, and video, and delivers real-"
|
||||
"Qwen3-Omni is a native end-to-end multilingual omni-modal foundation "
|
||||
"model. It processes text, images, audio, and video, and delivers real-"
|
||||
"time streaming responses in both text and natural speech. We introduce "
|
||||
"several architectural upgrades to improve performance and efficiency. The"
|
||||
" Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, "
|
||||
"equipped with chain-of-thought reasoning, supporting audio, video, and "
|
||||
"text input, with text output."
|
||||
" Thinking model of Qwen3-Omni-30B-A3B, which contains the thinker "
|
||||
"component, is equipped with chain-of-thought reasoning and supports "
|
||||
"audio, video, and text input, with text output."
|
||||
msgstr ""
|
||||
"Qwen3-Omni "
|
||||
"是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3"
|
||||
"-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。"
|
||||
"Qwen3-Omni 是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:7
|
||||
msgid ""
|
||||
@@ -54,21 +52,19 @@ msgstr "支持的功能"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:11
|
||||
msgid ""
|
||||
"Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/support_matrix/supported_models.html) to get the "
|
||||
"Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/support_matrix/supported_models.html) to get the "
|
||||
"model's supported feature matrix."
|
||||
msgstr ""
|
||||
"请参考 [支持的功能](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/support_matrix/supported_models.html) 以获取模型支持的功能矩阵。"
|
||||
"请参考 [支持的功能](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) 以获取模型支持的功能矩阵。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:13
|
||||
msgid ""
|
||||
"Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/feature_guide/index.html) to get the feature's "
|
||||
"Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/feature_guide/index.html) to get the feature's "
|
||||
"configuration."
|
||||
msgstr ""
|
||||
"请参考 [功能指南](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/feature_guide/index.html) 以获取功能的配置信息。"
|
||||
"请参考 [功能指南](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) 以获取功能的配置信息。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:15
|
||||
msgid "Environment Preparation"
|
||||
@@ -83,17 +79,15 @@ msgid ""
|
||||
"`Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards (64G × 2).[Download "
|
||||
"model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
||||
"Thinking) It is recommended to download the model weight to the shared "
|
||||
"directory of multiple nodes, such as `/root/.cache/`"
|
||||
"directory of multiple nodes, such as `/root/.cache/`"
|
||||
msgstr ""
|
||||
"`Qwen3-Omni-30B-A3B-Thinking` 需要 2 张 NPU 卡 (64G × "
|
||||
"2)。[下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
||||
"Thinking)。建议将模型权重下载到多节点的共享目录,例如 `/root/.cache/`。"
|
||||
"`Qwen3-Omni-30B-A3B-Thinking` 需要 2 张 NPU 卡 (64G × 2)。[下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)。建议将模型权重下载到多节点的共享目录,例如 `/root/.cache/`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:22
|
||||
msgid "Installation"
|
||||
msgstr "安装"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:24
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
|
||||
msgid "Use docker image"
|
||||
msgstr "使用 Docker 镜像"
|
||||
|
||||
@@ -109,10 +103,9 @@ msgid ""
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 [使用 Docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 [使用 Docker](../../installation.md#set-up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:32
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
|
||||
msgid "Build from source"
|
||||
msgstr "从源码构建"
|
||||
|
||||
@@ -125,8 +118,7 @@ msgid ""
|
||||
"Install `vllm-ascend`, refer to [set up using "
|
||||
"python](../../installation.md#set-up-using-python)."
|
||||
msgstr ""
|
||||
"安装 `vllm-ascend`,请参考 [使用 Python 设置](../../installation.md#set-up-using-"
|
||||
"python)。"
|
||||
"安装 `vllm-ascend`,请参考 [使用 Python 设置](../../installation.md#set-up-using-python)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:71
|
||||
msgid "Please install system dependencies"
|
||||
@@ -159,8 +151,7 @@ msgid ""
|
||||
" least 1, and for 32 GB of memory, tensor-parallel-size should be at "
|
||||
"least 2."
|
||||
msgstr ""
|
||||
"运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-"
|
||||
"size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。"
|
||||
"运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:188
|
||||
msgid "Functional Verification"
|
||||
@@ -188,8 +179,7 @@ msgid ""
|
||||
"dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in "
|
||||
"online mode."
|
||||
msgstr ""
|
||||
"以 `gsm8k`、`omnibench`、`bbh` 数据集作为测试数据集为例,在在线模式下运行 `Qwen3-Omni-30B-A3B-"
|
||||
"Thinking` 的精度评估。"
|
||||
"以 `gsm8k`、`omnibench`、`bbh` 数据集作为测试数据集为例,在在线模式下运行 `Qwen3-Omni-30B-A3B-Thinking` 的精度评估。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:239
|
||||
msgid ""
|
||||
@@ -197,21 +187,19 @@ msgid ""
|
||||
"evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
||||
"#install-evalscope-using-pip>) for `evalscope`installation."
|
||||
msgstr ""
|
||||
"关于 `evalscope` 的安装,请参考使用 evalscope "
|
||||
"(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
||||
"#install-evalscope-using-pip>)。"
|
||||
"关于 `evalscope` 的安装,请参考使用 evalscope (<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:240
|
||||
msgid "Run `evalscope` to execute the accuracy evaluation."
|
||||
msgstr "运行 `evalscope` 以执行精度评估。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:255
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:296
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen3"
|
||||
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 "
|
||||
"中的结果,仅供参考。"
|
||||
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 中的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:269
|
||||
msgid "Performance"
|
||||
@@ -228,8 +216,7 @@ msgid ""
|
||||
"benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more "
|
||||
"details."
|
||||
msgstr ""
|
||||
"以运行 `Qwen3-Omni-30B-A3B-Thinking` 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 [vllm"
|
||||
" 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
"以运行 `Qwen3-Omni-30B-A3B-Thinking` 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 [vllm 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:277
|
||||
msgid "There are three `vllm bench` subcommands:"
|
||||
@@ -249,12 +236,4 @@ msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:283
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
msgstr "以 `serve` 为例。按如下方式运行代码。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:296
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen3"
|
||||
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 "
|
||||
"中的结果,仅供参考。"
|
||||
msgstr "以 `serve` 为例。按如下方式运行代码。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -118,7 +118,7 @@ msgstr ""
|
||||
msgid "Installation"
|
||||
msgstr "安装"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:34
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md
|
||||
msgid "Use docker image"
|
||||
msgstr "使用 Docker 镜像"
|
||||
|
||||
@@ -140,7 +140,7 @@ msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考[使用 Docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:76
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md
|
||||
msgid "Build from source"
|
||||
msgstr "从源码构建"
|
||||
|
||||
@@ -185,15 +185,15 @@ msgid ""
|
||||
"A3(64G*16)."
|
||||
msgstr "在 1 个 Atlas 800 A3(64G*16) 上运行以下脚本以执行在线 128k 推理。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:133
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:132
|
||||
msgid "**Notice:**"
|
||||
msgstr "**注意:**"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:135
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:134
|
||||
msgid "The parameters are explained as follows:"
|
||||
msgstr "参数解释如下:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:137
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:136
|
||||
msgid ""
|
||||
"`--data-parallel-size` 1 and `--tensor-parallel-size` 16 are common "
|
||||
"settings for data parallelism (DP) and tensor parallelism (TP) sizes."
|
||||
@@ -201,13 +201,13 @@ msgstr ""
|
||||
"`--data-parallel-size` 1 和 `--tensor-parallel-size` 16 是数据并行 (DP) 和张量并行 "
|
||||
"(TP) 大小的常见设置。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:138
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:137
|
||||
msgid ""
|
||||
"`--max-model-len` represents the context length, which is the maximum "
|
||||
"value of the input plus output for a single request."
|
||||
msgstr "`--max-model-len` 表示上下文长度,即单个请求的输入加输出的最大值。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:139
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:138
|
||||
msgid ""
|
||||
"`--max-num-seqs` indicates the maximum number of requests that each DP "
|
||||
"group is allowed to process. If the number of requests sent to the "
|
||||
@@ -222,7 +222,7 @@ msgstr ""
|
||||
" 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= "
|
||||
"实际总并发数。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:140
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:139
|
||||
msgid ""
|
||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||
@@ -231,7 +231,7 @@ msgstr ""
|
||||
"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||
"ChunkPrefill/SplitFuse,这意味着:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:141
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:140
|
||||
msgid ""
|
||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||
"tokens`, it will be divided into multiple rounds of computation according"
|
||||
@@ -240,20 +240,20 @@ msgstr ""
|
||||
"(1) 如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-"
|
||||
"tokens` 被分成多轮计算;"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:142
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:141
|
||||
msgid ""
|
||||
"(2) Decode requests are prioritized for scheduling, and prefill requests "
|
||||
"are scheduled only if there is available capacity."
|
||||
msgstr "(2) 解码请求优先调度,只有在有可用容量时才调度预填充请求。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:143
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:142
|
||||
msgid ""
|
||||
"Generally, if `--max-num-batched-tokens` is set to a larger value, the "
|
||||
"overall latency will be lower, but the pressure on GPU memory (activation"
|
||||
" value usage) will be greater."
|
||||
msgstr "通常,如果 `--max-num-batched-tokens` 设置得较大,整体延迟会更低,但 GPU 内存(激活值使用)的压力会更大。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:144
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:143
|
||||
msgid ""
|
||||
"`--gpu-memory-utilization` represents the proportion of HBM that vLLM "
|
||||
"will use for actual inference. Its essential function is to calculate the"
|
||||
@@ -275,7 +275,7 @@ msgstr ""
|
||||
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理时不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||
"utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:145
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:144
|
||||
msgid ""
|
||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||
@@ -284,7 +284,7 @@ msgstr ""
|
||||
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||
"要么使用纯 EP,要么使用纯 TP。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:146
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:145
|
||||
msgid ""
|
||||
"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
|
||||
"To enable it, for mamba-like models Qwen3.5, set `--enable-prefix-"
|
||||
@@ -298,13 +298,13 @@ msgstr ""
|
||||
"的实现可能在调度时导致非常大的 block_size。例如,block_size 可能被调整为 2048,这意味着任何短于 2048 "
|
||||
"的前缀将永远不会被缓存。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:147
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:146
|
||||
msgid ""
|
||||
"`--quantization` \"ascend\" indicates that quantization is used. To "
|
||||
"disable quantization, remove this option."
|
||||
msgstr "`--quantization` \"ascend\" 表示使用了量化。要禁用量化,请移除此选项。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:148
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:147
|
||||
msgid ""
|
||||
"`--compilation-config` contains configurations related to the aclgraph "
|
||||
"graph mode. The most significant configurations are \"cudagraph_mode\" "
|
||||
@@ -319,7 +319,7 @@ msgstr ""
|
||||
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||
"\"FULL_DECODE_ONLY\"。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:150
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:149
|
||||
msgid ""
|
||||
"\"cudagraph_capture_sizes\": represents different levels of graph modes. "
|
||||
"The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. "
|
||||
@@ -332,123 +332,132 @@ msgstr ""
|
||||
"40,..., `--max-num-"
|
||||
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:152
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:151
|
||||
msgid "Multi-node Deployment with MP (Recommended)"
|
||||
msgstr "使用 MP 的多节点部署(推荐)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:154
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:153
|
||||
msgid ""
|
||||
"Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5"
|
||||
"-397B-A17B-w8a8-mtp` model across multiple nodes."
|
||||
msgstr "假设您有 2 个 Atlas 800 A2 节点,并希望跨多个节点部署 `Qwen3.5-397B-A17B-w8a8-mtp` 模型。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:156
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:155
|
||||
msgid "Node 0"
|
||||
msgstr "节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:202
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:201
|
||||
msgid "Node1"
|
||||
msgstr "节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:252
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:251
|
||||
msgid ""
|
||||
"If the service starts successfully, the following information will be "
|
||||
"displayed on node 0:"
|
||||
msgstr "如果服务启动成功,节点 0 上将显示以下信息:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:263
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:262
|
||||
msgid "Multi-node Deployment with Ray"
|
||||
msgstr "使用 Ray 的多节点部署"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:265
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:264
|
||||
msgid "refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](../features/ray.md)."
|
||||
msgstr "请参考 [Ray 分布式 (Qwen/Qwen3-235B-A22B)](../features/ray.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:267
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:266
|
||||
msgid "Prefill-Decode Disaggregation"
|
||||
msgstr "预填充-解码解耦"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:269
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:268
|
||||
msgid ""
|
||||
"We recommend using Mooncake for deployment: "
|
||||
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
||||
msgstr "我们推荐使用 Mooncake 进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||
msgstr ""
|
||||
"我们推荐使用 Mooncake "
|
||||
"进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:271
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:270
|
||||
msgid ""
|
||||
"Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 1P1D (3 "
|
||||
"nodes) to run Qwen3.5-397B-A17B."
|
||||
msgstr "以 Atlas 800 A3 (64G × 16) 为例,我们建议部署 1P1D(3 个节点)来运行 Qwen3.5-397B-A17B。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:273
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:272
|
||||
msgid "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` require 3 Atlas 800 A3 (64G × 16)."
|
||||
msgstr "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` 需要 3 个 Atlas 800 A3 (64G × 16)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:275
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:274
|
||||
msgid ""
|
||||
"To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need "
|
||||
"to deploy `run_p.sh` 、`run_d0.sh` and `run_d1.sh` script on each node and"
|
||||
" deploy a `proxy.sh` script on prefill master node to forward requests."
|
||||
msgstr "要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署 `run_p.sh`、`run_d0.sh` 和 `run_d1.sh` 脚本,并在预填充主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||
msgstr ""
|
||||
"要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署 "
|
||||
"`run_p.sh`、`run_d0.sh` 和 `run_d1.sh` 脚本,并在预填充主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:277
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:276
|
||||
msgid "Prefill Node 0 `run_p.sh` script"
|
||||
msgstr "预填充节点 0 `run_p.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:352
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:350
|
||||
msgid "Decode Node 0 `run_d0.sh` script"
|
||||
msgstr "解码节点 0 `run_d0.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:432
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:430
|
||||
msgid "Decode Node 1 `run_d1.sh` script"
|
||||
msgstr "解码节点 1 `run_d1.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:519
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:517
|
||||
msgid "Run the `proxy.sh` script on the prefill master node"
|
||||
msgstr "在预填充主节点上运行 `proxy.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:521
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:519
|
||||
msgid ""
|
||||
"Run a proxy server on the same node with the prefiller service instance. "
|
||||
"You can get the proxy program in the repository's examples: "
|
||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||
"project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:547
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:545
|
||||
msgid "Functional Verification"
|
||||
msgstr "功能验证"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:549
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:547
|
||||
msgid "Once your server is started, you can query the model with input prompts:"
|
||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:562
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:560
|
||||
msgid "Accuracy Evaluation"
|
||||
msgstr "精度评估"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:564
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:562
|
||||
msgid "Here are two accuracy evaluation methods."
|
||||
msgstr "以下是两种精度评估方法。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:566
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:578
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:564
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:576
|
||||
msgid "Using AISBench"
|
||||
msgstr "使用 AISBench"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:568
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:566
|
||||
msgid ""
|
||||
"Refer to [Using "
|
||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||
"details."
|
||||
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:570
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:568
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen3.5"
|
||||
"-397B-A17B-w8a8` in `vllm-ascend:v0.17.0rc1` for reference only."
|
||||
msgstr "执行后,您可以获得结果,以下是 `vllm-ascend:v0.17.0rc1` 中 `Qwen3.5-397B-A17B-w8a8` 的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果,以下是 `vllm-ascend:v0.17.0rc1` 中 `Qwen3.5-397B-A17B-w8a8` "
|
||||
"的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:76
|
||||
msgid "dataset"
|
||||
@@ -490,54 +499,74 @@ msgstr "生成"
|
||||
msgid "96.74"
|
||||
msgstr "96.74"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:576
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:574
|
||||
msgid "Performance"
|
||||
msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:580
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:578
|
||||
msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参阅[使用 AISBench "
|
||||
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:582
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:580
|
||||
msgid "Using vLLM Benchmark"
|
||||
msgstr "使用 vLLM Benchmark"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:584
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:582
|
||||
msgid "Run performance evaluation of `Qwen3.5-397B-A17B-w8a8` as an example."
|
||||
msgstr "以运行 `Qwen3.5-397B-A17B-w8a8` 的性能评估为例。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:586
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:584
|
||||
msgid ""
|
||||
"Refer to [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) "
|
||||
"for more details."
|
||||
msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||
msgstr ""
|
||||
"更多详情请参阅 [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:588
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:586
|
||||
msgid "There are three `vllm bench` subcommands:"
|
||||
msgstr "`vllm bench` 有三个子命令:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:590
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:588
|
||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:591
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:589
|
||||
msgid "`serve`: Benchmark the online serving throughput."
|
||||
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:592
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:590
|
||||
msgid "`throughput`: Benchmark offline inference throughput."
|
||||
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:594
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:592
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
msgstr "以 `serve` 为例。运行代码如下。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:601
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:599
|
||||
msgid ""
|
||||
"After about several minutes, you can get the performance evaluation "
|
||||
"result."
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:601
|
||||
msgid "Qwen3.5-397B-A17B Known issues"
|
||||
msgstr "Qwen3.5-397B-A17B 已知问题"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:603
|
||||
msgid ""
|
||||
"Issue1: For single-node deployment scenario, when fused_mc2 is enabled, "
|
||||
"using multi-DP model deployment may cause garbled or empty outputs after "
|
||||
"the model triggers recomputation.When tuning performance by adjusting "
|
||||
"model parallelism, ensure that this fused operator is disabled when DP > "
|
||||
"1. For PD deployment scenario,D nodes can avoid this problem by enabling "
|
||||
"the recompute scheduler."
|
||||
msgstr ""
|
||||
"问题1:在单节点部署场景下,当启用 fused_mc2 时,使用多 DP 模型部署可能会导致模型触发重计算后输出乱码或为空。在通过调整模型并行度来调优性能时,请确保当 DP > 1 时禁用此融合算子。对于 PD 部署场景,D 节点可以通过启用重计算调度器来避免此问题。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -37,7 +37,9 @@ msgid ""
|
||||
"model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of "
|
||||
"vLLM Ascend support the model."
|
||||
msgstr ""
|
||||
"Qwen3 Embedding 模型系列是 Qwen 家族最新的专有模型,专为文本嵌入和排序任务设计。它基于 Qwen3 系列的稠密基础模型,提供了多种尺寸(0.6B、4B 和 8B)的全面文本嵌入和重排序模型。本指南描述了如何使用 vLLM Ascend 运行该模型。请注意,只有 vLLM Ascend 0.9.2rc1 及更高版本支持此模型。"
|
||||
"Qwen3 Embedding 模型系列是 Qwen 家族最新的专有模型,专为文本嵌入和排序任务设计。它基于 Qwen3 "
|
||||
"系列的稠密基础模型,提供了多种尺寸(0.6B、4B 和 8B)的全面文本嵌入和重排序模型。本指南描述了如何使用 vLLM Ascend "
|
||||
"运行该模型。请注意,只有 vLLM Ascend 0.9.2rc1 及更高版本支持此模型。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:7
|
||||
msgid "Supported Features"
|
||||
@@ -62,19 +64,25 @@ msgstr "模型权重"
|
||||
msgid ""
|
||||
"`Qwen3-Embedding-8B` [Download model "
|
||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-8B)"
|
||||
msgstr "`Qwen3-Embedding-8B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-8B)"
|
||||
msgstr ""
|
||||
"`Qwen3-Embedding-8B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3"
|
||||
"-Embedding-8B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:16
|
||||
msgid ""
|
||||
"`Qwen3-Embedding-4B` [Download model "
|
||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-4B)"
|
||||
msgstr "`Qwen3-Embedding-4B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-4B)"
|
||||
msgstr ""
|
||||
"`Qwen3-Embedding-4B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3"
|
||||
"-Embedding-4B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:17
|
||||
msgid ""
|
||||
"`Qwen3-Embedding-0.6B` [Download model "
|
||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)"
|
||||
msgstr "`Qwen3-Embedding-0.6B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)"
|
||||
msgstr ""
|
||||
"`Qwen3-Embedding-0.6B` "
|
||||
"[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:19
|
||||
msgid ""
|
||||
@@ -96,7 +104,9 @@ msgstr "您可以使用我们的官方 docker 镜像来运行 `Qwen3-Embedding`
|
||||
msgid ""
|
||||
"Start the docker image on your node, refer to [using "
|
||||
"docker](../../installation.md#set-up-using-docker)."
|
||||
msgstr "在您的节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-docker)。"
|
||||
msgstr ""
|
||||
"在您的节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-"
|
||||
"docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:27
|
||||
msgid ""
|
||||
@@ -142,10 +152,12 @@ msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:98
|
||||
msgid ""
|
||||
"Run performance of `Qwen3-Reranker-8B` as an example. Refer to [vllm "
|
||||
"Run performance of `Qwen3-Embedding-8B` as an example. Refer to [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/) for more "
|
||||
"details."
|
||||
msgstr "以 `Qwen3-Reranker-8B` 的运行性能为例。更多详情请参考 [vllm 基准测试](https://docs.vllm.ai/en/latest/contributing/)。"
|
||||
msgstr ""
|
||||
"以 `Qwen3-Embedding-8B` 的运行性能为例。更多详情请参考 [vllm "
|
||||
"基准测试](https://docs.vllm.ai/en/latest/contributing/)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:101
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
|
||||
Reference in New Issue
Block a user