[v0.18.0][Doc] Translated Doc files 2026-04-22 (#8565)
## Auto-Translation Summary Translated **43** file(s): - <code>docs/source/locale/zh_CN/LC_MESSAGES/community/versioning_policy.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/KV_Cache_Pool_Guide.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/disaggregated_prefill.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/patch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/Design_Documents/quantization.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/index.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/contribution/multi_node_test.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_ais_bench.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_evalscope.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_lm_eval.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/developer_guide/evaluation/using_opencompass.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/faqs.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/installation.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/long_sequence_context_parallel_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_colocated_mooncake_multi_instance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_multi_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/features/pd_disaggregation_mooncake_single_node.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/DeepSeek-V3.1.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM4.x.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/GLM5.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/PaddleOCR-VL.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen-VL-Dense.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-235B-A22B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3.5-397B-A17B.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/tutorials/models/Qwen3_embedding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/configuration/additional_config.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/Fine_grained_TP.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/batch_invariance.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/context_parallel.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/cpu_binding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/dynamic_batch.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/eplb_swift_balancer.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/kv_pool.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/layer_sharding.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/netloader.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/npugraph_ex.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/sleep_mode.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/ucm_deployment.po</code> - <code>docs/source/locale/zh_CN/LC_MESSAGES/user_guide/feature_guide/weight_prefetch.po</code> --- [Workflow run](https://github.com/vllm-project/vllm-ascend/actions/runs/24767290887) Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>
This commit is contained in:
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -381,7 +381,8 @@ msgid ""
|
||||
"If you're using v0.7.3, don't forget to install [mindie-"
|
||||
"turbo](https://pypi.org/project/mindie-turbo) as well."
|
||||
msgstr ""
|
||||
"如果您正在使用 v0.7.3,请别忘了同时安装 [mindie-turbo](https://pypi.org/project/mindie-turbo)。"
|
||||
"如果您正在使用 v0.7.3,请别忘了同时安装 [mindie-turbo](https://pypi.org/project/mindie-"
|
||||
"turbo)。"
|
||||
|
||||
#: ../../source/community/versioning_policy.md:58
|
||||
msgid ""
|
||||
@@ -1019,12 +1020,12 @@ msgstr "软件依赖管理"
|
||||
#: ../../source/community/versioning_policy.md:174
|
||||
msgid ""
|
||||
"`torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable "
|
||||
"version to [PyPI](https://pypi.org/project/torch-npu) every 3 months, a "
|
||||
"version to [PyPi](https://pypi.org/project/torch-npu) every 3 months, a "
|
||||
"development version (aka the POC version) every month, and a nightly "
|
||||
"version every day. The PyPI stable version **CAN** be used in vLLM Ascend"
|
||||
"version every day. The PyPi stable version **CAN** be used in vLLM Ascend"
|
||||
" final version, the monthly dev version **ONLY CAN** be used in vLLM "
|
||||
"Ascend RC version for rapid iteration, and the nightly version **CANNOT**"
|
||||
" be used in vLLM Ascend any version or branch."
|
||||
" be used in any vLLM Ascend version or branch."
|
||||
msgstr ""
|
||||
"`torch-npu`:Ascend Extension for PyTorch(torch-npu)每 3 个月在 "
|
||||
"[PyPI](https://pypi.org/project/torch-npu) 发布一个稳定版本,每月发布一个开发版本(亦称 POC "
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -37,17 +37,17 @@ msgstr "前缀缓存是大语言模型推理中的一项重要特性,可以显
|
||||
msgid ""
|
||||
"However, the performance gain from prefix caching is highly dependent on "
|
||||
"the cache hit rate, while the cache hit rate can be limited if one only "
|
||||
"uses HBM for KV cache storage."
|
||||
msgstr "然而,前缀缓存带来的性能提升高度依赖于缓存命中率,而如果仅使用 HBM 存储 KV 缓存,缓存命中率会受到限制。"
|
||||
"uses on-chip memory for KV cache storage."
|
||||
msgstr "然而,前缀缓存带来的性能提升高度依赖于缓存命中率,而如果仅使用片上内存存储 KV 缓存,缓存命中率会受到限制。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:9
|
||||
msgid ""
|
||||
"Hence, KV Cache Pool is proposed to utilize various types of storage "
|
||||
"including HBM, DRAM, and SSD, making a pool for KV Cache storage while "
|
||||
"making the prefix of requests visible across all nodes, increasing the "
|
||||
"cache hit rate for all requests."
|
||||
"including on-chip memory, DRAM, and SSD, making a pool for KV Cache "
|
||||
"storage while making the prefix of requests visible across all nodes, "
|
||||
"increasing the cache hit rate for all requests."
|
||||
msgstr ""
|
||||
"因此,我们提出了 KV 缓存池,旨在利用包括 HBM、DRAM 和 SSD 在内的多种存储类型,构建一个 KV "
|
||||
"因此,我们提出了 KV 缓存池,旨在利用包括片上内存、DRAM 和 SSD 在内的多种存储类型,构建一个 KV "
|
||||
"缓存存储池,同时使请求的前缀在所有节点间可见,从而提高所有请求的缓存命中率。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:11
|
||||
@@ -111,9 +111,9 @@ msgstr "工作原理"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:25
|
||||
msgid ""
|
||||
"The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.)"
|
||||
" through a connector-based architecture."
|
||||
msgstr "KV 缓存池通过基于连接器的架构,整合了多个内存层级(HBM、DRAM、SSD 等)。"
|
||||
"The KV Cache Pool integrates multiple memory tiers (on-chip memory, DRAM,"
|
||||
" SSD, etc.) through a connector-based architecture."
|
||||
msgstr "KV 缓存池通过基于连接器的架构,整合了多个内存层级(片上内存、DRAM、SSD 等)。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:27
|
||||
msgid ""
|
||||
@@ -124,25 +124,25 @@ msgstr "每个连接器实现了一个统一的接口,用于根据访问频率
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:29
|
||||
msgid ""
|
||||
"When combined with vLLM’s Prefix Caching mechanism, the pool enables "
|
||||
"efficient caching both locally (in HBM) and globally (via Mooncake), "
|
||||
"ensuring that frequently used prefixes remain hot while less frequently "
|
||||
"accessed KV data can spill over to lower-cost memory."
|
||||
"When combined with vLLM's Prefix Caching mechanism, the pool enables "
|
||||
"efficient caching both locally (in on-chip memory) and globally (via "
|
||||
"Mooncake), ensuring that frequently used prefixes remain hot while less "
|
||||
"frequently accessed KV data can spill over to lower-cost memory."
|
||||
msgstr ""
|
||||
"当与 vLLM 的前缀缓存机制结合时,该池能够实现本地(HBM 中)和全局(通过 "
|
||||
"当与 vLLM 的前缀缓存机制结合时,该池能够实现本地(片上内存中)和全局(通过 "
|
||||
"Mooncake)的高效缓存,确保常用前缀保持热状态,而访问频率较低的 KV 数据则可以溢出到成本更低的内存中。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:31
|
||||
msgid "1. Combining KV Cache Pool with HBM Prefix Caching"
|
||||
msgstr "1. 将 KV 缓存池与 HBM 前缀缓存结合"
|
||||
msgid "1. Combining KV Cache Pool with on-chip memory Prefix Caching"
|
||||
msgstr "1. 将 KV 缓存池与片上内存前缀缓存结合"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:33
|
||||
msgid ""
|
||||
"Prefix Caching with HBM is already supported by the vLLM V1 Engine. By "
|
||||
"introducing KV Connector V1, users can seamlessly combine HBM-based "
|
||||
"Prefix Caching with Mooncake-backed KV Pool."
|
||||
"Prefix Caching with on-chip memory is already supported by the vLLM V1 "
|
||||
"Engine. By introducing KV Connector V1, users can seamlessly combine on-"
|
||||
"chip memory-based Prefix Caching with Mooncake-backed KV Pool."
|
||||
msgstr ""
|
||||
"vLLM V1 引擎已支持基于 HBM 的前缀缓存。通过引入 KV Connector V1,用户可以无缝地将基于 HBM 的前缀缓存与 "
|
||||
"vLLM V1 引擎已支持基于片上内存的前缀缓存。通过引入 KV Connector V1,用户可以无缝地将基于片上内存的前缀缓存与 "
|
||||
"Mooncake 支持的 KV 池结合起来。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:36
|
||||
@@ -160,24 +160,25 @@ msgid "**Workflow**:"
|
||||
msgstr "**工作流程**:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:40
|
||||
msgid "The engine first checks for prefix hits in the HBM cache."
|
||||
msgstr "引擎首先检查 HBM 缓存中的前缀命中情况。"
|
||||
msgid "The engine first checks for prefix hits in the on-chip memory cache."
|
||||
msgstr "引擎首先检查片上内存缓存中的前缀命中情况。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:42
|
||||
msgid ""
|
||||
"After getting the number of hit tokens on HBM, it queries the KV Pool via"
|
||||
" the connector. If there are additional hits in the KV Pool, we get the "
|
||||
"**additional blocks only** from the KV Pool, and get the rest of the "
|
||||
"blocks directly from HBM to minimize the data transfer latency."
|
||||
"After getting the number of hit tokens on on-chip memory, it queries the "
|
||||
"KV Pool via the connector. If there are additional hits in the KV Pool, "
|
||||
"we get the **additional blocks only** from the KV Pool, and get the rest "
|
||||
"of the blocks directly from on-chip memory to minimize the data transfer "
|
||||
"latency."
|
||||
msgstr ""
|
||||
"获取 HBM 上的命中令牌数量后,引擎通过连接器查询 KV 池。如果在 KV 池中有额外的命中,我们**仅从 KV "
|
||||
"池获取额外的块**,其余块则直接从 HBM 获取,以最小化数据传输延迟。"
|
||||
"获取片上内存上的命中令牌数量后,引擎通过连接器查询 KV 池。如果在 KV 池中有额外的命中,我们**仅从 KV "
|
||||
"池获取额外的块**,其余块则直接从片上内存获取,以最小化数据传输延迟。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:44
|
||||
msgid ""
|
||||
"After the KV Caches in the KV Pool are loaded into HBM, the remaining "
|
||||
"process is the same as Prefix Caching in HBM."
|
||||
msgstr "将 KV 池中的 KV 缓存加载到 HBM 后,剩余过程与 HBM 中的前缀缓存相同。"
|
||||
"After the KV Caches in the KV Pool are loaded into on-chip memory, the "
|
||||
"remaining process is the same as Prefix Caching in on-chip memory."
|
||||
msgstr "将 KV 池中的 KV 缓存加载到片上内存后,剩余过程与片上内存中的前缀缓存相同。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:46
|
||||
msgid "2. Combining KV Cache Pool with Mooncake PD Disaggregation"
|
||||
@@ -202,12 +203,12 @@ msgstr ""
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:52
|
||||
msgid ""
|
||||
"The key benefit of doing this is that we can keep the gain in performance"
|
||||
" by computing less with Prefix Caching from HBM and KV Pool for Prefill "
|
||||
"Nodes, while not sacrificing the data transfer efficiency between Prefill"
|
||||
" and Decode nodes with P2P KV Connector that transfers KV Caches between "
|
||||
"NPU devices directly."
|
||||
" by computing less with Prefix Caching from on-chip memory and KV Pool "
|
||||
"for Prefill Nodes, while not sacrificing the data transfer efficiency "
|
||||
"between Prefill and Decode nodes with P2P KV Connector that transfers KV "
|
||||
"Caches between NPU devices directly."
|
||||
msgstr ""
|
||||
"这样做的主要好处是,我们可以通过为预填充节点使用来自 HBM 和 KV "
|
||||
"这样做的主要好处是,我们可以通过为预填充节点使用来自片上内存和 KV "
|
||||
"池的前缀缓存来减少计算量,从而保持性能增益,同时又不牺牲预填充节点与解码节点之间的数据传输效率,因为 P2P KV 连接器直接在 NPU "
|
||||
"设备间传输 KV 缓存。"
|
||||
|
||||
@@ -332,7 +333,8 @@ msgstr "限制"
|
||||
msgid ""
|
||||
"Currently, MooncakeStore for vLLM-Ascend only supports DRAM as the "
|
||||
"storage for KV Cache pool."
|
||||
msgstr "目前,vLLM-Ascend 的 MooncakeStore 仅支持 DRAM 作为 KV 缓存池的存储。"
|
||||
msgstr ""
|
||||
"目前,vLLM-Ascend 的 MooncakeStore 仅支持 DRAM 作为 KV 缓存池的存储介质。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/KV_Cache_Pool_Guide.md:91
|
||||
msgid ""
|
||||
@@ -344,5 +346,5 @@ msgid ""
|
||||
"there's no prefix cache hit (or even better, revert only one block and "
|
||||
"keep using the Prefix Caches before that)."
|
||||
msgstr ""
|
||||
"目前,如果我们成功查找到一个键并发现其存在,但在调用 KV 池的 get 函数时获取失败,我们仅输出一条日志表明 get "
|
||||
"操作失败并继续执行;因此,该特定请求的准确性可能会受到影响。我们将通过回退请求并假设没有前缀缓存命中来重新计算所有内容(或者更优的方案是,仅回退一个块并继续使用该块之前的前缀缓存)来处理这种情况。"
|
||||
"目前,如果我们成功查找到一个键并确认其存在,但在调用 KV 池的 get 函数时获取失败,我们仅输出一条日志表明 get "
|
||||
"操作失败并继续执行;因此,该特定请求的准确性可能会受到影响。我们将通过回退该请求并假设没有前缀缓存命中来重新计算所有内容(或者更优的方案是,仅回退一个块并继续使用该块之前的前缀缓存)来处理这种情况。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -193,8 +193,10 @@ msgid "**Build CPU pools**:"
|
||||
msgstr "**构建 CPU 池**:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:39
|
||||
msgid "Use **global_slice** for A3 devices; **topo_affinity** for A2 and 310P."
|
||||
msgstr "对 A3 设备使用 **global_slice**;对 A2 和 310P 使用 **topo_affinity**。"
|
||||
msgid ""
|
||||
"Use **global_slice** for A3 devices; **topo_affinity** for A2 and Atlas "
|
||||
"300 inference products."
|
||||
msgstr "对 A3 设备使用 **global_slice**;对 A2 和 Atlas 300 推理产品使用 **topo_affinity**。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:40
|
||||
msgid "If topo affinity is missing, fall back to global_slice."
|
||||
@@ -597,6 +599,7 @@ msgid "Example 5: A2/310P topo_affinity with NUMA extension"
|
||||
msgstr "示例 5: 具有NUMA扩展的 A2/310P topo_affinity"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:163
|
||||
#, python-brace-format
|
||||
msgid "npu_affinity = {0: [0..7], 1: [0..7]} (from `npu-smi info -t topo`)"
|
||||
msgstr "npu_affinity = {0: [0..7], 1: [0..7]} (来自 `npu-smi info -t topo`)"
|
||||
|
||||
@@ -748,7 +751,9 @@ msgid ""
|
||||
"global slicing yields 16 CPUs per NPU (0–15, 16–31, 32–47, 48–63), so "
|
||||
"each NPU’s pool stays within a single NUMA node."
|
||||
msgstr ""
|
||||
"示例(对称布局):2个NUMA节点,共64个CPU。NUMA0 = CPU 0–31,NUMA1 = CPU 32–63,cpuset为0–63。对于4个逻辑NPU,全局切片为每个NPU分配16个CPU (0–15, 16–31, 32–47, 48–63),因此每个NPU的CPU池都保持在单个NUMA节点内。"
|
||||
"示例(对称布局):2个NUMA节点,共64个CPU。NUMA0 = CPU 0–31,NUMA1 = CPU "
|
||||
"32–63,cpuset为0–63。对于4个逻辑NPU,全局切片为每个NPU分配16个CPU (0–15, 16–31, 32–47, "
|
||||
"48–63),因此每个NPU的CPU池都保持在单个NUMA节点内。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/cpu_binding.md:212
|
||||
msgid "**Runtime dependencies**:"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -32,9 +32,7 @@ msgid ""
|
||||
"This feature addresses the need to optimize the **Time Per Output Token "
|
||||
"(TPOT)** and **Time To First Token (TTFT)** in large-scale inference "
|
||||
"tasks. The motivation is two-fold:"
|
||||
msgstr ""
|
||||
"此功能旨在优化大规模推理任务中的**单输出令牌时间 (TPOT)** 和**首令牌时间 "
|
||||
"(TTFT)**。其动机主要有两方面:"
|
||||
msgstr "此功能旨在优化大规模推理任务中的**单输出令牌时间 (TPOT)** 和**首令牌时间 (TTFT)**。其动机主要有两方面:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:7
|
||||
msgid ""
|
||||
@@ -46,19 +44,22 @@ msgid ""
|
||||
"to better system performance tuning, particularly for **TTFT** and "
|
||||
"**TPOT**."
|
||||
msgstr ""
|
||||
"**调整 P 节点和 D 节点的并行策略与实例数量** 采用解耦式预填充策略,此功能允许系统灵活调整 P(预填充器)节点和 D(解码器)节点的并行化策略(例如数据并行 (dp)、张量并行 (tp) 和专家并行 (ep))以及实例数量。这有助于实现更好的系统性能调优,特别是针对 **TTFT** 和 **TPOT**。"
|
||||
"**调整 P 节点和 D 节点的并行策略与实例数量** 采用解耦式预填充策略,此功能允许系统灵活调整 P(预填充器)节点和 "
|
||||
"D(解码器)节点的并行化策略(例如数据并行 (dp)、张量并行 (tp) 和专家并行 "
|
||||
"(ep))以及实例数量。这有助于实现更好的系统性能调优,特别是针对 **TTFT** 和 **TPOT**。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:10
|
||||
msgid ""
|
||||
"**Optimizing TPOT** Without the disaggregated-prefill strategy, prefill "
|
||||
"tasks are inserted during decoding, which results in inefficiencies and "
|
||||
"delays. Disaggregated-prefill solves this by allowing for better control "
|
||||
"over the system’s **TPOT**. By managing chunked prefill tasks "
|
||||
"over the system's **TPOT**. By managing chunked prefill tasks "
|
||||
"effectively, the system avoids the challenge of determining the optimal "
|
||||
"chunk size and provides more reliable control over the time taken for "
|
||||
"generating output tokens."
|
||||
msgstr ""
|
||||
"**优化 TPOT** 在没有解耦式预填充策略的情况下,预填充任务会在解码过程中插入,导致效率低下和延迟。解耦式预填充通过允许更好地控制系统 **TPOT** 来解决此问题。通过有效管理分块的预填充任务,系统避免了确定最佳分块大小的挑战,并对生成输出令牌所需时间提供了更可靠的控制。"
|
||||
"**优化 TPOT** 在没有解耦式预填充策略的情况下,预填充任务会在解码过程中插入,导致效率低下和延迟。解耦式预填充通过允许更好地控制系统 "
|
||||
"**TPOT** 来解决此问题。通过有效管理分块的预填充任务,系统避免了确定最佳分块大小的挑战,并对生成输出令牌所需时间提供了更可靠的控制。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:15
|
||||
msgid "Usage"
|
||||
@@ -101,10 +102,11 @@ msgstr "1. 设计思路"
|
||||
msgid ""
|
||||
"Under the disaggregated-prefill, a global proxy receives external "
|
||||
"requests, forwarding prefill to P nodes and decode to D nodes; the KV "
|
||||
"cache (key–value cache) is exchanged between P and D nodes via peer-to-"
|
||||
"cache (key-value cache) is exchanged between P and D nodes via peer-to-"
|
||||
"peer (P2P) communication."
|
||||
msgstr ""
|
||||
"在解耦式预填充架构下,一个全局代理接收外部请求,将预填充请求转发给 P 节点,将解码请求转发给 D 节点;KV 缓存(键值缓存)通过点对点 (P2P) 通信在 P 节点和 D 节点之间交换。"
|
||||
"在解耦式预填充架构下,一个全局代理接收外部请求,将预填充请求转发给 P 节点,将解码请求转发给 D 节点;KV 缓存(键值缓存)通过点对点 "
|
||||
"(P2P) 通信在 P 节点和 D 节点之间交换。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:33
|
||||
msgid "2. Implementation Design"
|
||||
@@ -116,7 +118,9 @@ msgid ""
|
||||
" respectively.  "
|
||||
""
|
||||
msgstr ""
|
||||
"我们的设计图如下所示,分别展示了拉取和推送方案。 "
|
||||
"我们的设计图如下所示,分别展示了拉取和推送方案。 "
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:35
|
||||
msgid "alt text"
|
||||
@@ -128,7 +132,7 @@ msgstr "Mooncake 连接器"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:41
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:49
|
||||
msgid "The request is sent to the Proxy’s `_handle_completions` endpoint."
|
||||
msgid "The request is sent to the Proxy's `_handle_completions` endpoint."
|
||||
msgstr "请求被发送到代理的 `_handle_completions` 端点。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:42
|
||||
@@ -137,16 +141,18 @@ msgid ""
|
||||
"request, configuring `kv_transfer_params` with `do_remote_decode=True`, "
|
||||
"`max_completion_tokens=1`, and `min_tokens=1`."
|
||||
msgstr ""
|
||||
"代理调用 `select_prefiller` 选择一个 P 节点并转发请求,配置 `kv_transfer_params` 为 `do_remote_decode=True`、`max_completion_tokens=1` 和 `min_tokens=1`。"
|
||||
"代理调用 `select_prefiller` 选择一个 P 节点并转发请求,配置 `kv_transfer_params` 为 "
|
||||
"`do_remote_decode=True`、`max_completion_tokens=1` 和 `min_tokens=1`。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:43
|
||||
msgid ""
|
||||
"After the P node’s scheduler finishes prefill, `update_from_output` "
|
||||
"invokes the schedule connector’s `request_finished` to defer KV cache "
|
||||
"After the P node's scheduler finishes prefill, `update_from_output` "
|
||||
"invokes the schedule connector's `request_finished` to defer KV cache "
|
||||
"release, constructs `kv_transfer_params` with `do_remote_prefill=True`, "
|
||||
"and returns to the Proxy."
|
||||
msgstr ""
|
||||
"P 节点的调度器完成预填充后,`update_from_output` 调用调度连接器的 `request_finished` 以延迟释放 KV 缓存,构建 `kv_transfer_params` 为 `do_remote_prefill=True`,并返回给代理。"
|
||||
"P 节点的调度器完成预填充后,`update_from_output` 调用调度连接器的 `request_finished` 以延迟释放 KV "
|
||||
"缓存,构建 `kv_transfer_params` 为 `do_remote_prefill=True`,并返回给代理。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:44
|
||||
msgid ""
|
||||
@@ -162,7 +168,8 @@ msgid ""
|
||||
"P node to release KV cache and proceeds with decoding to return the "
|
||||
"result."
|
||||
msgstr ""
|
||||
"在 D 节点上,调度器将请求标记为 `RequestStatus.WAITING_FOR_REMOTE_KVS`,预分配 KV 缓存,调用 `kv_connector_no_forward` 拉取远程 KV 缓存,然后通知 P 节点释放 KV 缓存并继续解码以返回结果。"
|
||||
"在 D 节点上,调度器将请求标记为 `RequestStatus.WAITING_FOR_REMOTE_KVS`,预分配 KV 缓存,调用 "
|
||||
"`kv_connector_no_forward` 拉取远程 KV 缓存,然后通知 P 节点释放 KV 缓存并继续解码以返回结果。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:47
|
||||
msgid "Mooncake Layerwise Connector"
|
||||
@@ -174,7 +181,8 @@ msgid ""
|
||||
"request, configuring `kv_transfer_params` with `do_remote_prefill=True` "
|
||||
"and setting the `metaserver` endpoint."
|
||||
msgstr ""
|
||||
"代理调用 `select_decoder` 选择一个 D 节点并转发请求,配置 `kv_transfer_params` 为 `do_remote_prefill=True` 并设置 `metaserver` 端点。"
|
||||
"代理调用 `select_decoder` 选择一个 D 节点并转发请求,配置 `kv_transfer_params` 为 "
|
||||
"`do_remote_prefill=True` 并设置 `metaserver` 端点。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:51
|
||||
msgid ""
|
||||
@@ -183,20 +191,24 @@ msgid ""
|
||||
"cache, then calls `kv_connector_no_forward` to send a request to the "
|
||||
"metaserver and waits for the KV cache transfer to complete."
|
||||
msgstr ""
|
||||
"在 D 节点上,调度器使用 `kv_transfer_params` 将请求标记为 `RequestStatus.WAITING_FOR_REMOTE_KVS`,预分配 KV 缓存,然后调用 `kv_connector_no_forward` 向元服务器发送请求并等待 KV 缓存传输完成。"
|
||||
"在 D 节点上,调度器使用 `kv_transfer_params` 将请求标记为 "
|
||||
"`RequestStatus.WAITING_FOR_REMOTE_KVS`,预分配 KV 缓存,然后调用 "
|
||||
"`kv_connector_no_forward` 向元服务器发送请求并等待 KV 缓存传输完成。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:52
|
||||
msgid ""
|
||||
"The Proxy’s `metaserver` endpoint receives the request, calls "
|
||||
"The Proxy's `metaserver` endpoint receives the request, calls "
|
||||
"`select_prefiller` to choose a P node, and forwards it with "
|
||||
"`kv_transfer_params` set to `do_remote_decode=True`, "
|
||||
"`max_completion_tokens=1`, and `min_tokens=1`."
|
||||
msgstr ""
|
||||
"代理的 `metaserver` 端点接收请求,调用 `select_prefiller` 选择一个 P 节点,并转发请求,设置 `kv_transfer_params` 为 `do_remote_decode=True`、`max_completion_tokens=1` 和 `min_tokens=1`。"
|
||||
"代理的 `metaserver` 端点接收请求,调用 `select_prefiller` 选择一个 P 节点,并转发请求,设置 "
|
||||
"`kv_transfer_params` 为 `do_remote_decode=True`、`max_completion_tokens=1` "
|
||||
"和 `min_tokens=1`。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:53
|
||||
msgid ""
|
||||
"During processing, the P node’s scheduler pushes KV cache layer-wise; "
|
||||
"During processing, the P node's scheduler pushes KV cache layer-wise; "
|
||||
"once all layers pushing is complete, it releases the request and notifies"
|
||||
" the D node to begin decoding."
|
||||
msgstr "在处理过程中,P 节点的调度器逐层推送 KV 缓存;所有层推送完成后,它释放请求并通知 D 节点开始解码。"
|
||||
@@ -240,10 +252,11 @@ msgstr "4. 规格设计"
|
||||
msgid ""
|
||||
"This feature is flexible and supports various configurations, including "
|
||||
"setups with MLA and GQA models. It is compatible with A2 and A3 hardware "
|
||||
"configurations and facilitates scenarios involving both equal and unequal"
|
||||
" TP setups across multiple P and D nodes."
|
||||
"configurations and facilitates scenarios involving equal TP setups and "
|
||||
"certain unequal TP setups across multiple P and D nodes."
|
||||
msgstr ""
|
||||
"此功能灵活,支持多种配置,包括使用 MLA 和 GQA 模型的设置。它与 A2 和 A3 硬件配置兼容,并支持跨多个 P 节点和 D 节点的相等和不相等 TP 设置场景。"
|
||||
"此功能灵活,支持多种配置,包括使用 MLA 和 GQA 模型的设置。它与 A2 和 A3 硬件配置兼容,并支持跨多个 P 节点和 D "
|
||||
"节点的相等和不相等 TP 设置场景。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md
|
||||
msgid "Feature"
|
||||
@@ -317,7 +330,8 @@ msgid ""
|
||||
"supported and whether kv_connector_module_path exists and is loadable. On"
|
||||
" transfer failures, emit clear error logs for diagnostics."
|
||||
msgstr ""
|
||||
"通过检查 kv_connector 类型是否受支持以及 kv_connector_module_path 是否存在且可加载来验证 KV 传输配置。传输失败时,发出清晰的错误日志以供诊断。"
|
||||
"通过检查 kv_connector 类型是否受支持以及 kv_connector_module_path 是否存在且可加载来验证 KV "
|
||||
"传输配置。传输失败时,发出清晰的错误日志以供诊断。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:91
|
||||
msgid "2. Port Conflict Detection"
|
||||
@@ -328,7 +342,9 @@ msgid ""
|
||||
"Before startup, perform a port-usage check on configured ports (e.g., "
|
||||
"rpc_port, metrics_port, http_port/metaserver) by attempting to bind. If a"
|
||||
" port is already in use, fail fast and log an error."
|
||||
msgstr "启动前,通过尝试绑定来对配置的端口(例如 rpc_port、metrics_port、http_port/metaserver)进行端口使用情况检查。如果端口已被占用,快速失败并记录错误。"
|
||||
msgstr ""
|
||||
"启动前,通过尝试绑定来对配置的端口(例如 "
|
||||
"rpc_port、metrics_port、http_port/metaserver)进行端口使用情况检查。如果端口已被占用,快速失败并记录错误。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/disaggregated_prefill.md:95
|
||||
msgid "3. PD Ratio Validation"
|
||||
@@ -357,4 +373,6 @@ msgid ""
|
||||
"higher TP degree than the D nodes and the P TP count is an integer "
|
||||
"multiple of the D TP count are supported (i.e., P_tp > D_tp and P_tp % "
|
||||
"D_tp = 0)."
|
||||
msgstr "在非对称 TP 配置中,仅支持 P 节点的 TP 度数高于 D 节点且 P 节点的 TP 数量是 D 节点 TP 数量的整数倍的情况(即 P_tp > D_tp 且 P_tp % D_tp = 0)。"
|
||||
msgstr ""
|
||||
"在非对称 TP 配置中,仅支持 P 节点的 TP 度数高于 D 节点且 P 节点的 TP 数量是 D 节点 TP 数量的整数倍的情况(即 P_tp"
|
||||
" > D_tp 且 P_tp % D_tp = 0)。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -39,7 +39,10 @@ msgid ""
|
||||
"to place experts of the same group on the same node to reduce inter-node "
|
||||
"data traffic, whenever possible."
|
||||
msgstr ""
|
||||
"在使用专家并行 (EP) 时,不同的专家被分配到不同的 NPU 上。鉴于不同专家的负载可能因当前工作负载而异,保持不同 NPU 之间的负载均衡至关重要。我们采用冗余专家策略,通过复制高负载的专家来实现。然后,我们启发式地将这些复制的专家打包到 NPU 上,以确保它们之间的负载均衡。此外,得益于 MoE 模型中使用的组限制专家路由,我们也尽可能将同一组的专家放置在同一节点上,以减少节点间的数据流量。"
|
||||
"在使用专家并行 (EP) 时,不同的专家被分配到不同的 NPU 上。鉴于不同专家的负载可能因当前工作负载而异,保持不同 NPU "
|
||||
"之间的负载均衡至关重要。我们采用冗余专家策略,通过复制高负载的专家来实现。然后,我们启发式地将这些复制的专家打包到 NPU "
|
||||
"上,以确保它们之间的负载均衡。此外,得益于 MoE "
|
||||
"模型中使用的组限制专家路由,我们也尽可能将同一组的专家放置在同一节点上,以减少节点间的数据流量。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:7
|
||||
msgid ""
|
||||
@@ -50,7 +53,8 @@ msgid ""
|
||||
"predicting expert loads is outside the scope of this repository. A common"
|
||||
" method is to use a moving average of historical statistics."
|
||||
msgstr ""
|
||||
"为了方便复现和部署,vLLM Ascend 在 `vllm_ascend/eplb/core/policy` 中支持已部署的 EP 负载均衡算法。该算法根据估计的专家负载计算一个均衡的专家复制和放置计划。请注意,预测专家负载的具体方法不在本仓库的讨论范围内。一种常见的方法是使用历史统计数据的移动平均值。"
|
||||
"为了方便复现和部署,vLLM Ascend 在 `vllm_ascend/eplb/core/policy` 中支持已部署的 EP "
|
||||
"负载均衡算法。该算法根据估计的专家负载计算一个均衡的专家复制和放置计划。请注意,预测专家负载的具体方法不在本仓库的讨论范围内。一种常见的方法是使用历史统计数据的移动平均值。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:9
|
||||
msgid ""
|
||||
@@ -214,7 +218,8 @@ msgid ""
|
||||
"hierarchical load balancing policy can be used in the prefilling stage "
|
||||
"with a smaller expert-parallel size."
|
||||
msgstr ""
|
||||
"当服务器节点数量能整除专家组数量时,我们使用分层负载均衡策略来利用组限制专家路由。我们首先将专家组均匀地打包到节点上,确保不同节点间的负载均衡。然后,我们在每个节点内复制专家。最后,我们将复制的专家打包到各个 NPU 上,以确保它们之间的负载均衡。分层负载均衡策略可以在预填充阶段使用,此时专家并行规模较小。"
|
||||
"当服务器节点数量能整除专家组数量时,我们使用分层负载均衡策略来利用组限制专家路由。我们首先将专家组均匀地打包到节点上,确保不同节点间的负载均衡。然后,我们在每个节点内复制专家。最后,我们将复制的专家打包到各个"
|
||||
" NPU 上,以确保它们之间的负载均衡。分层负载均衡策略可以在预填充阶段使用,此时专家并行规模较小。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:92
|
||||
msgid "Global Load Balancing"
|
||||
@@ -227,7 +232,8 @@ msgid ""
|
||||
"experts onto individual NPUs. This policy can be adopted in the decoding "
|
||||
"stage with a larger expert-parallel size."
|
||||
msgstr ""
|
||||
"在其他情况下,我们使用全局负载均衡策略,该策略不考虑专家组,而是在全局范围内复制专家,并将复制的专家打包到各个 NPU 上。此策略可以在解码阶段采用,此时专家并行规模较大。"
|
||||
"在其他情况下,我们使用全局负载均衡策略,该策略不考虑专家组,而是在全局范围内复制专家,并将复制的专家打包到各个 NPU "
|
||||
"上。此策略可以在解码阶段采用,此时专家并行规模较大。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:96
|
||||
msgid "Add a New EPLB Policy"
|
||||
@@ -246,8 +252,9 @@ msgid ""
|
||||
"parameters `current_expert_table`, `expert_workload` and return types "
|
||||
"`newplacement`. For example:"
|
||||
msgstr ""
|
||||
"继承 `policy_abstract.py` 中的 `EplbPolicy` 抽象类,并重写 `rebalance_experts` 接口,确保输入参数 "
|
||||
"`current_expert_table`、`expert_workload` 和返回类型 `newplacement` 保持一致。例如:"
|
||||
"继承 `policy_abstract.py` 中的 `EplbPolicy` 抽象类,并重写 `rebalance_experts` "
|
||||
"接口,确保输入参数 `current_expert_table`、`expert_workload` 和返回类型 `newplacement` "
|
||||
"保持一致。例如:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:126
|
||||
msgid ""
|
||||
@@ -380,7 +387,8 @@ msgid ""
|
||||
"minimum values and be subject to valid value validation. For example, "
|
||||
"`expert_heat_collection_interval` must be greater than 0:"
|
||||
msgstr ""
|
||||
"所有整型输入参数必须明确指定其最大值和最小值,并接受有效值验证。例如,`expert_heat_collection_interval` 必须大于0:"
|
||||
"所有整型输入参数必须明确指定其最大值和最小值,并接受有效值验证。例如,`expert_heat_collection_interval` "
|
||||
"必须大于0:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:197
|
||||
msgid "File Path"
|
||||
@@ -419,28 +427,27 @@ msgid ""
|
||||
"function body, specifying the type of exception captured and the failure "
|
||||
"handling (e.g., logging exceptions or returning a failure status)."
|
||||
msgstr ""
|
||||
"所有方法参数必须指定参数类型和默认值,并且函数必须包含针对默认参数的默认返回值处理。建议使用 `try-except` 块来处理函数体,指定捕获的异常类型和失败处理(例如,记录异常或返回失败状态)。"
|
||||
"所有方法参数必须指定参数类型和默认值,并且函数必须包含针对默认参数的默认返回值处理。建议使用 `try-except` "
|
||||
"块来处理函数体,指定捕获的异常类型和失败处理(例如,记录异常或返回失败状态)。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:235
|
||||
msgid "Consistency"
|
||||
msgstr "一致性"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:237
|
||||
msgid "Expert Map"
|
||||
msgstr "专家映射"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:239
|
||||
msgid ""
|
||||
"The expert map must be globally unique during initialization and update. "
|
||||
"In a multi-node scenario during initialization, distributed communication"
|
||||
" should be used to verify the consistency of expert maps across each "
|
||||
"rank. If they are inconsistent, the user should be notified which ranks "
|
||||
"have inconsistent maps. During the update process, if only a few layers "
|
||||
"or the expert table of a certain rank has been changed, the updated "
|
||||
"expert table must be synchronized with the EPLB's context to ensure "
|
||||
"global consistency."
|
||||
"rank. If they are inconsistent, the user should be notified of which "
|
||||
"ranks have inconsistent maps. During the update process, if only a few "
|
||||
"layers or the expert table of a certain rank has been changed, the "
|
||||
"updated expert table must be synchronized with the EPLB's context to "
|
||||
"ensure global consistency."
|
||||
msgstr ""
|
||||
"专家映射在初始化和更新期间必须是全局唯一的。在初始化期间的多节点场景中,应使用分布式通信来验证每个 rank 上专家映射的一致性。如果不一致,应通知用户哪些 rank 的映射不一致。在更新过程中,如果只有少数层或某个 rank 的专家表被更改,则必须将更新后的专家表与 EPLB 的上下文同步,以确保全局一致性。"
|
||||
"专家映射在初始化和更新期间必须是全局唯一的。在初始化期间的多节点场景中,应使用分布式通信来验证每个 rank "
|
||||
"上专家映射的一致性。如果不一致,应通知用户哪些 rank 的映射不一致。在更新过程中,如果只有少数层或某个 rank "
|
||||
"的专家表被更改,则必须将更新后的专家表与 EPLB 的上下文同步,以确保全局一致性。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/eplb_swift_balancer.md:242
|
||||
msgid "Expert Weight"
|
||||
@@ -464,4 +471,6 @@ msgid ""
|
||||
"performance data collection), start the script and add `export "
|
||||
"EXPERT_MAP_RECORD=\"true\"`."
|
||||
msgstr ""
|
||||
"在使用 EPLB 之前,启动脚本并添加 `export DYNAMIC_EPLB=\"true\"`。在执行负载数据收集(或性能数据收集)之前,启动脚本并添加 `export EXPERT_MAP_RECORD=\"true\"`。"
|
||||
"在使用 EPLB 之前,启动脚本并添加 `export "
|
||||
"DYNAMIC_EPLB=\"true\"`。在执行负载数据收集(或性能数据收集)之前,启动脚本并添加 `export "
|
||||
"EXPERT_MAP_RECORD=\"true\"`。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,21 +29,21 @@ msgstr "工作原理"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:5
|
||||
msgid ""
|
||||
"This is an optimization based on Fx graphs, which can be considered an "
|
||||
"This is an optimization based on FX graphs, which can be considered an "
|
||||
"acceleration solution for the aclgraph mode."
|
||||
msgstr "这是一种基于 Fx 图的优化,可视为 aclgraph 模式的一种加速方案。"
|
||||
msgstr "这是一种基于 FX 图的优化,可视为 aclgraph 模式的一种加速方案。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:7
|
||||
msgid "You can get its code [code](https://gitcode.com/Ascend/torchair)"
|
||||
msgstr "您可以在 [code](https://gitcode.com/Ascend/torchair) 获取其代码"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:9
|
||||
msgid "Default Fx Graph Optimization"
|
||||
msgstr "默认 Fx 图优化"
|
||||
msgid "Default FX Graph Optimization"
|
||||
msgstr "默认 FX 图优化"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:11
|
||||
msgid "Fx Graph pass"
|
||||
msgstr "Fx 图处理过程"
|
||||
msgid "FX Graph pass"
|
||||
msgstr "FX 图处理过程"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:13
|
||||
msgid ""
|
||||
@@ -59,11 +59,13 @@ msgid ""
|
||||
"operators with a form of non-in-place operators + copy operators. "
|
||||
"npugraph_ex will reverse this process, restoring the in-place operators "
|
||||
"and reducing memory movement."
|
||||
msgstr "对于模型的原始输入参数,如果包含原位运算符,Dynamo 的 Functionalize 过程会将其替换为非原位运算符 + 复制运算符的形式。npugraph_ex 将逆转此过程,恢复原位运算符,减少内存移动。"
|
||||
msgstr ""
|
||||
"对于模型的原始输入参数,如果包含原位运算符,Dynamo 的 Functionalize 过程会将其替换为非原位运算符 + "
|
||||
"复制运算符的形式。npugraph_ex 将逆转此过程,恢复原位运算符,减少内存移动。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:16
|
||||
msgid "Fx fusion pass"
|
||||
msgstr "Fx 融合处理过程"
|
||||
msgid "FX fusion pass"
|
||||
msgstr "FX 融合处理过程"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:18
|
||||
msgid ""
|
||||
@@ -92,7 +94,9 @@ msgid ""
|
||||
"Users can register a custom graph fusion pass in TorchAir to modify "
|
||||
"PyTorch FX graphs. The registration relies on the register_replacement "
|
||||
"API."
|
||||
msgstr "用户可以在 TorchAir 中注册自定义的图融合处理过程,以修改 PyTorch FX 图。注册依赖于 register_replacement API。"
|
||||
msgstr ""
|
||||
"用户可以在 TorchAir 中注册自定义的图融合处理过程,以修改 PyTorch FX 图。注册依赖于 register_replacement"
|
||||
" API。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:28
|
||||
msgid "Below is the declaration of this API and a demo of its usage."
|
||||
@@ -182,7 +186,9 @@ msgid ""
|
||||
" on the matching result, such as checking whether the fused operators are"
|
||||
" on the same stream, checking the device type, checking the input shapes,"
|
||||
" and so on."
|
||||
msgstr "算子融合后的额外验证函数。该函数的输入参数必须是来自 torch._inductor.pattern_matcher 的 Match 对象,用于对匹配结果进行进一步的自定义检查,例如检查融合后的算子是否在同一流上、检查设备类型、检查输入形状等。"
|
||||
msgstr ""
|
||||
"算子融合后的额外验证函数。该函数的输入参数必须是来自 torch._inductor.pattern_matcher 的 Match "
|
||||
"对象,用于对匹配结果进行进一步的自定义检查,例如检查融合后的算子是否在同一流上、检查设备类型、检查输入形状等。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md
|
||||
msgid "search_fn_pattern"
|
||||
@@ -195,7 +201,9 @@ msgid ""
|
||||
"object. After passing this parameter, search_fn will no longer be used to"
|
||||
" match operator combinations; instead, this parameter will be used "
|
||||
"directly as the matching rule."
|
||||
msgstr "通常无需提供自定义模式对象。其定义遵循原生 PyTorch MultiOutputPattern 对象的规则。传入此参数后,将不再使用 search_fn 来匹配算子组合,而是直接使用此参数作为匹配规则。"
|
||||
msgstr ""
|
||||
"通常无需提供自定义模式对象。其定义遵循原生 PyTorch MultiOutputPattern 对象的规则。传入此参数后,将不再使用 "
|
||||
"search_fn 来匹配算子组合,而是直接使用此参数作为匹配规则。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:43
|
||||
msgid "Usage Example"
|
||||
@@ -206,7 +214,9 @@ msgid ""
|
||||
"The default fusion pass in npugraph_ex is also implemented based on this "
|
||||
"API. You can see more examples of using this API in the vllm-ascend and "
|
||||
"npugraph_ex code repositories."
|
||||
msgstr "npugraph_ex 中的默认融合处理过程也是基于此 API 实现的。您可以在 vllm-ascend 和 npugraph_ex 代码仓库中查看更多使用此 API 的示例。"
|
||||
msgstr ""
|
||||
"npugraph_ex 中的默认融合处理过程也是基于此 API 实现的。您可以在 vllm-ascend 和 npugraph_ex "
|
||||
"代码仓库中查看更多使用此 API 的示例。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/npugraph_ex.md:99
|
||||
msgid "DFX"
|
||||
@@ -217,4 +227,6 @@ msgid ""
|
||||
"By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch "
|
||||
"community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX "
|
||||
"graphs throughout the entire process."
|
||||
msgstr "通过复用 PyTorch 社区的 TORCH_COMPILE_DEBUG 环境变量,当设置 TORCH_COMPILE_DEBUG=1 时,将输出整个过程中的 FX 图。"
|
||||
msgstr ""
|
||||
"通过复用 PyTorch 社区的 TORCH_COMPILE_DEBUG 环境变量,当设置 TORCH_COMPILE_DEBUG=1 "
|
||||
"时,将输出整个过程中的 FX 图。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,8 +29,8 @@ msgid ""
|
||||
"cycle of vLLM and vLLM Ascend and their hardware limitations, we need to "
|
||||
"patch some code in vLLM to make it compatible with vLLM Ascend."
|
||||
msgstr ""
|
||||
"vLLM Ascend 是 vLLM 的一个平台插件。由于 vLLM 和 vLLM Ascend "
|
||||
"的发布周期不同且存在硬件限制,我们需要对 vLLM 中的部分代码打补丁,以使其兼容 vLLM Ascend。"
|
||||
"vLLM Ascend 是 vLLM 的一个平台插件。由于 vLLM 和 vLLM Ascend 的发布周期不同且存在硬件限制,我们需要对 "
|
||||
"vLLM 中的部分代码打补丁,以使其兼容 vLLM Ascend。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/patch.md:5
|
||||
msgid ""
|
||||
@@ -121,7 +121,8 @@ msgid ""
|
||||
"initializing the worker process."
|
||||
msgstr ""
|
||||
"对于在线和离线模式,vLLM 引擎核心进程在初始化 worker 进程时,会在 "
|
||||
"`vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` 处调用 worker 补丁。"
|
||||
"`vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` 处调用 "
|
||||
"worker 补丁。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/patch.md:35
|
||||
msgid "How to write a patch"
|
||||
@@ -150,6 +151,7 @@ msgid ""
|
||||
msgstr "确定我们需要修补哪个进程。例如,这里的 `distributed` 属于 vLLM 主进程,因此我们应该修补 `platform`。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/patch.md:41
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"Create the patch file in the right folder. The file should be named as "
|
||||
"`patch_{module_name}.py`. The example here is "
|
||||
@@ -169,7 +171,8 @@ msgid ""
|
||||
"`vllm_ascend/patch/platform/__init__.py`."
|
||||
msgstr ""
|
||||
"在 `__init__.py` 中导入补丁文件。在此示例中,将 `import "
|
||||
"vllm_ascend.patch.platform.patch_distributed` 添加到 `vllm_ascend/patch/platform/__init__.py` 中。"
|
||||
"vllm_ascend.patch.platform.patch_distributed` 添加到 "
|
||||
"`vllm_ascend/patch/platform/__init__.py` 中。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/patch.md:55
|
||||
msgid ""
|
||||
@@ -183,8 +186,8 @@ msgid ""
|
||||
"should contain the Unit Test and E2E Test as well. You can find more "
|
||||
"details in [test guide](../contribution/testing.md)"
|
||||
msgstr ""
|
||||
"添加单元测试和端到端测试。vLLM Ascend 中任何新增的代码都应包含单元测试和端到端测试。更多详情请参阅 [测试指南]"
|
||||
"(../contribution/testing.md)。"
|
||||
"添加单元测试和端到端测试。vLLM Ascend 中任何新增的代码都应包含单元测试和端到端测试。更多详情请参阅 "
|
||||
"[测试指南](../contribution/testing.md)。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/patch.md:73
|
||||
msgid "Limitations"
|
||||
@@ -201,8 +204,9 @@ msgid ""
|
||||
"`DPEngineCoreProc` entirely."
|
||||
msgstr ""
|
||||
"在 V1 引擎中,vLLM 启动三种进程:主进程、EngineCore 进程和 Worker 进程。目前 vLLM Ascend "
|
||||
"默认只能修补主进程和 Worker 进程中的代码。如果你想修补 EngineCore 进程中运行的代码,你需要在设置阶段完全修补 EngineCore "
|
||||
"进程。相关完整代码位于 `vllm.v1.engine.core`。请完全重写 `EngineCoreProc` 和 `DPEngineCoreProc`。"
|
||||
"默认只能修补主进程和 Worker 进程中的代码。如果你想修补 EngineCore 进程中运行的代码,你需要在设置阶段完全修补 "
|
||||
"EngineCore 进程。相关完整代码位于 `vllm.v1.engine.core`。请完全重写 `EngineCoreProc` 和 "
|
||||
"`DPEngineCoreProc`。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/patch.md:76
|
||||
msgid ""
|
||||
@@ -212,10 +216,10 @@ msgid ""
|
||||
"for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend"
|
||||
" can't distinguish the version of the vLLM you're using. In this case, "
|
||||
"you can set the environment variable `VLLM_VERSION` to specify the "
|
||||
"version of the vLLM you're using, and then the patch for v0.10.0 should "
|
||||
"work."
|
||||
"version of the vLLM you're using, and then the patch for that version "
|
||||
"(e.g., v0.9.n) should work."
|
||||
msgstr ""
|
||||
"如果你运行的是经过编辑的 vLLM 代码,vLLM 的版本可能会自动更改。例如,如果你基于 v0.9.n 运行编辑后的 vLLM,vLLM "
|
||||
"的版本可能会变为 v0.9.nxxx。在这种情况下,vLLM Ascend 中针对 v0.9.n 的补丁将无法按预期工作,因为 vLLM Ascend "
|
||||
"无法区分你正在使用的 vLLM 版本。此时,你可以设置环境变量 `VLLM_VERSION` 来指定你使用的 vLLM 版本,这样针对 v0.10.0 "
|
||||
"的补丁就应该能正常工作了。"
|
||||
"的版本可能会变为 v0.9.nxxx。在这种情况下,vLLM Ascend 中针对 v0.9.n 的补丁将无法按预期工作,因为 vLLM "
|
||||
"Ascend 无法区分你正在使用的 vLLM 版本。此时,你可以设置环境变量 `VLLM_VERSION` 来指定你使用的 vLLM "
|
||||
"版本,这样针对该版本(例如 v0.9.n)的补丁就应该能正常工作了。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -60,14 +60,21 @@ msgid ""
|
||||
" `get_quant_method` is called to obtain the quantization method "
|
||||
"corresponding to each weight part, stored in the `quant_method` "
|
||||
"attribute."
|
||||
msgstr "vLLM Ascend 注册了一个自定义的 Ascend 量化方法。通过配置 `--quantization ascend` 参数(或离线时使用 `quantization=\"ascend\"`),即可启用量化功能。在构建 `quant_config` 时,会初始化已注册的 `AscendModelSlimConfig`,并调用 `get_quant_method` 来获取每个权重部分对应的量化方法,存储在 `quant_method` 属性中。"
|
||||
msgstr ""
|
||||
"vLLM Ascend 注册了一个自定义的 Ascend 量化方法。通过配置 `--quantization ascend` 参数(或离线时使用 "
|
||||
"`quantization=\"ascend\"`),即可启用量化功能。在构建 `quant_config` 时,会初始化已注册的 "
|
||||
"`AscendModelSlimConfig`,并调用 `get_quant_method` 来获取每个权重部分对应的量化方法,存储在 "
|
||||
"`quant_method` 属性中。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:15
|
||||
msgid ""
|
||||
"Currently supported quantization methods include `AscendLinearMethod`, "
|
||||
"`AscendFusedMoEMethod`, `AscendEmbeddingMethod`, and their corresponding "
|
||||
"non-quantized methods:"
|
||||
msgstr "当前支持的量化方法包括 `AscendLinearMethod`、`AscendFusedMoEMethod`、`AscendEmbeddingMethod` 及其对应的非量化方法:"
|
||||
msgstr ""
|
||||
"当前支持的量化方法包括 "
|
||||
"`AscendLinearMethod`、`AscendFusedMoEMethod`、`AscendEmbeddingMethod` "
|
||||
"及其对应的非量化方法:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:17
|
||||
msgid ""
|
||||
@@ -105,14 +112,18 @@ msgid ""
|
||||
"conversion, etc.; the `apply` method is used to perform activation "
|
||||
"quantization and quantized matrix multiplication calculations during the "
|
||||
"forward process."
|
||||
msgstr "`create_weights` 方法用于权重初始化;`process_weights_after_loading` 方法用于权重后处理,例如转置、格式转换、数据类型转换等;`apply` 方法用于在前向传播过程中执行激活量化和量化矩阵乘法计算。"
|
||||
msgstr ""
|
||||
"`create_weights` 方法用于权重初始化;`process_weights_after_loading` "
|
||||
"方法用于权重后处理,例如转置、格式转换、数据类型转换等;`apply` 方法用于在前向传播过程中执行激活量化和量化矩阵乘法计算。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:27
|
||||
msgid ""
|
||||
"We need to implement the `create_weights`, "
|
||||
"`process_weights_after_loading`, and `apply` methods for different "
|
||||
"**layers** (**attention**, **mlp**, **moe**)."
|
||||
msgstr "我们需要为不同的**层**(**attention**、**mlp**、**moe**)实现 `create_weights`、`process_weights_after_loading` 和 `apply` 方法。"
|
||||
"**layers** (**attention**, **mlp**, **MoE (Mixture of Experts)**)."
|
||||
msgstr ""
|
||||
"我们需要为不同的**层**(**attention**、**mlp**、**MoE (Mixture of Experts)**)实现 "
|
||||
"`create_weights`、`process_weights_after_loading` 和 `apply` 方法。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:29
|
||||
msgid ""
|
||||
@@ -120,7 +131,9 @@ msgid ""
|
||||
" file **quant_model_description.json** needs to be read. This file "
|
||||
"describes the quantization configuration and parameters for each part of "
|
||||
"the model weights, for example:"
|
||||
msgstr "**补充说明**:加载模型时,需要读取量化模型的描述文件 **quant_model_description.json**。该文件描述了模型各部分权重的量化配置和参数,例如:"
|
||||
msgstr ""
|
||||
"**补充说明**:加载模型时,需要读取量化模型的描述文件 "
|
||||
"**quant_model_description.json**。该文件描述了模型各部分权重的量化配置和参数,例如:"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:49
|
||||
msgid ""
|
||||
@@ -138,21 +151,27 @@ msgid ""
|
||||
"`W4A8_DYNAMIC`), determine supported layers (linear, moe, attention), and"
|
||||
" design the quantization scheme (static/dynamic, "
|
||||
"pertensor/perchannel/pergroup)."
|
||||
msgstr "**步骤 1:算法设计**。定义算法 ID(例如 `W4A8_DYNAMIC`),确定支持的层(linear、moe、attention),并设计量化方案(静态/动态、pertensor/perchannel/pergroup)。"
|
||||
msgstr ""
|
||||
"**步骤 1:算法设计**。定义算法 ID(例如 "
|
||||
"`W4A8_DYNAMIC`),确定支持的层(linear、moe、attention),并设计量化方案(静态/动态、pertensor/perchannel/pergroup)。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:54
|
||||
msgid ""
|
||||
"**Step 2: Registration**. Use the `@register_scheme` decorator in "
|
||||
"`vllm_ascend/quantization/methods/registry.py` to register your "
|
||||
"quantization scheme class."
|
||||
msgstr "**步骤 2:注册**。在 `vllm_ascend/quantization/methods/registry.py` 中使用 `@register_scheme` 装饰器注册您的量化方案类。"
|
||||
msgstr ""
|
||||
"**步骤 2:注册**。在 `vllm_ascend/quantization/methods/registry.py` 中使用 "
|
||||
"`@register_scheme` 装饰器注册您的量化方案类。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:68
|
||||
msgid ""
|
||||
"**Step 3: Implementation**. Create an algorithm implementation file, such"
|
||||
" as `vllm_ascend/quantization/methods/w4a8.py`, and implement the method "
|
||||
"class and logic."
|
||||
msgstr "**步骤 3:实现**。创建一个算法实现文件,例如 `vllm_ascend/quantization/methods/w4a8.py`,并实现方法类和逻辑。"
|
||||
msgstr ""
|
||||
"**步骤 3:实现**。创建一个算法实现文件,例如 "
|
||||
"`vllm_ascend/quantization/methods/w4a8.py`,并实现方法类和逻辑。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:69
|
||||
msgid ""
|
||||
@@ -182,7 +201,11 @@ msgid ""
|
||||
"`vllm_ascend/quantization/modelslim_config.py` (e.g., `qkv_proj`, "
|
||||
"`gate_up_proj`, `experts`) to ensure sharding consistency and correct "
|
||||
"loading."
|
||||
msgstr "**融合模块映射**:将模型的 `model_type` 添加到 `vllm_ascend/quantization/modelslim_config.py` 中的 `packed_modules_model_mapping`(例如 `qkv_proj`、`gate_up_proj`、`experts`),以确保分片一致性和正确加载。"
|
||||
msgstr ""
|
||||
"**融合模块映射**:将模型的 `model_type` 添加到 "
|
||||
"`vllm_ascend/quantization/modelslim_config.py` 中的 "
|
||||
"`packed_modules_model_mapping`(例如 "
|
||||
"`qkv_proj`、`gate_up_proj`、`experts`),以确保分片一致性和正确加载。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:96
|
||||
msgid ""
|
||||
@@ -339,10 +362,9 @@ msgstr "混合"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md
|
||||
msgid ""
|
||||
"PD Colocation Scenario uses dynamic quantization for both P node and D "
|
||||
"node; PD Disaggregation Scenario uses dynamic quantization for P node and"
|
||||
" static for D node"
|
||||
msgstr "PD 共部署场景下,P节点和D节点均使用动态量化;PD 分离部署场景下,P节点使用动态量化,D节点使用静态量化"
|
||||
"We support two deployment modes: PD Colocation (dynamic quantization for "
|
||||
"both P and D) and PD Disaggregation (dynamic-quant P and static-quant D)"
|
||||
msgstr "我们支持两种部署模式:PD 共部署(P和D均使用动态量化)和 PD 分离部署(P使用动态量化,D使用静态量化)"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:112
|
||||
msgid ""
|
||||
@@ -350,10 +372,10 @@ msgid ""
|
||||
"factors with better performance, while dynamic quantization computes "
|
||||
"scaling factors on-the-fly for each token/activation tensor with higher "
|
||||
"precision."
|
||||
msgstr "**静态与动态:** 静态量化使用预计算的缩放因子,性能更好;而动态量化则为每个 token/激活张量实时计算缩放因子,精度更高。"
|
||||
msgstr "**静态与动态:** 静态量化使用预计算的缩放因子,性能更优;而动态量化则为每个 token/激活张量实时计算缩放因子,精度更高。"
|
||||
|
||||
#: ../../source/developer_guide/Design_Documents/quantization.md:114
|
||||
msgid ""
|
||||
"**Granularity:** Refers to the scope of scaling factor computation (e.g.,"
|
||||
" per-tensor, per-channel, per-group)."
|
||||
msgstr "**粒度:** 指缩放因子计算的范围(例如,per-tensor、per-channel、per-group)。"
|
||||
msgstr "**粒度:** 指缩放因子计算的范围(例如,per-tensor、per-channel、per-group)。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -19,7 +19,7 @@ msgstr ""
|
||||
"Content-Transfer-Encoding: 8bit\n"
|
||||
"Generated-By: Babel 2.18.0\n"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:108
|
||||
#: ../../source/developer_guide/contribution/index.md:107
|
||||
msgid "Index"
|
||||
msgstr "索引"
|
||||
|
||||
@@ -62,120 +62,124 @@ msgid "Run CI locally"
|
||||
msgstr "本地运行 CI"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:37
|
||||
msgid "After completing \"Run lint\" setup, you can run CI locally:"
|
||||
msgstr "完成“运行代码检查”设置后,你可以在本地运行 CI:"
|
||||
msgid ""
|
||||
"After completing \"Run lint\" setup, you can run CI (Continuous "
|
||||
"integration) locally:"
|
||||
msgstr "完成“运行代码检查”设置后,你可以在本地运行 CI(持续集成):"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:63
|
||||
#: ../../source/developer_guide/contribution/index.md:62
|
||||
msgid "Submit the commit"
|
||||
msgstr "提交更改"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:70
|
||||
#: ../../source/developer_guide/contribution/index.md:69
|
||||
msgid "🎉 Congratulations! You have completed the development environment setup."
|
||||
msgstr "🎉 恭喜!您已完成开发环境的设置。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:72
|
||||
#: ../../source/developer_guide/contribution/index.md:71
|
||||
msgid "Testing locally"
|
||||
msgstr "本地测试"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:74
|
||||
#: ../../source/developer_guide/contribution/index.md:73
|
||||
msgid ""
|
||||
"You can refer to [Testing](./testing.md) to set up a testing environment"
|
||||
" and running tests locally."
|
||||
msgstr "你可以参考 [测试](./testing.md) 文档来设置测试环境并在本地运行测试。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:76
|
||||
#: ../../source/developer_guide/contribution/index.md:75
|
||||
msgid "DCO and Signed-off-by"
|
||||
msgstr "DCO 与签署确认"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:78
|
||||
#: ../../source/developer_guide/contribution/index.md:77
|
||||
msgid ""
|
||||
"When contributing changes to this project, you must agree to the DCO. "
|
||||
"Commits must include a `Signed-off-by:` header which certifies agreement "
|
||||
"with the terms of the DCO."
|
||||
msgstr "向本项目贡献更改时,您必须同意 DCO。提交必须包含 `Signed-off-by:` 标头,以证明您同意 DCO 的条款。"
|
||||
"with the terms of the DCO (Developer Certificate of Origin)."
|
||||
msgstr "向本项目贡献更改时,您必须同意 DCO。提交必须包含 `Signed-off-by:` 标头,以证明您同意 DCO(开发者原创证书)的条款。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:80
|
||||
#: ../../source/developer_guide/contribution/index.md:79
|
||||
msgid "Using `-s` with `git commit` will automatically add this header."
|
||||
msgstr "在 `git commit` 命令中使用 `-s` 参数会自动添加此标头。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:82
|
||||
#: ../../source/developer_guide/contribution/index.md:81
|
||||
msgid "PR Title and Classification"
|
||||
msgstr "PR 标题与分类"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:84
|
||||
#: ../../source/developer_guide/contribution/index.md:83
|
||||
msgid ""
|
||||
"Only specific types of PRs will be reviewed. The PR title is prefixed "
|
||||
"appropriately to indicate the type of change. Please use one of the "
|
||||
"following:"
|
||||
msgstr "只有特定类型的 PR 会被审核。PR 标题应使用适当的前缀来指明更改类型。请使用以下前缀之一:"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:86
|
||||
#: ../../source/developer_guide/contribution/index.md:85
|
||||
msgid "`[Attention]` for new features or optimization in attention."
|
||||
msgstr "`[Attention]` 用于注意力机制的新功能或优化。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:87
|
||||
#: ../../source/developer_guide/contribution/index.md:86
|
||||
msgid "`[Communicator]` for new features or optimization in communicators."
|
||||
msgstr "`[Communicator]` 用于通信器的新功能或优化。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:88
|
||||
#: ../../source/developer_guide/contribution/index.md:87
|
||||
msgid "`[ModelRunner]` for new features or optimization in model runner."
|
||||
msgstr "`[ModelRunner]` 用于模型运行器的新功能或优化。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:89
|
||||
#: ../../source/developer_guide/contribution/index.md:88
|
||||
msgid "`[Platform]` for new features or optimization in platform."
|
||||
msgstr "`[Platform]` 用于平台的新功能或优化。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:90
|
||||
#: ../../source/developer_guide/contribution/index.md:89
|
||||
msgid "`[Worker]` for new features or optimization in worker."
|
||||
msgstr "`[Worker]` 用于工作器的新功能或优化。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:91
|
||||
#: ../../source/developer_guide/contribution/index.md:90
|
||||
msgid ""
|
||||
"`[Core]` for new features or optimization in the core vllm-ascend logic "
|
||||
"(such as platform, attention, communicators, model runner)"
|
||||
msgstr "`[Core]` 用于核心 vllm-ascend 逻辑中的新功能或优化(例如平台、注意力机制、通信器、模型运行器)。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:92
|
||||
#: ../../source/developer_guide/contribution/index.md:91
|
||||
msgid "`[Kernel]` for changes affecting compute kernels and ops."
|
||||
msgstr "`[Kernel]` 用于影响计算内核和操作的更改。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:93
|
||||
#: ../../source/developer_guide/contribution/index.md:92
|
||||
msgid "`[Bugfix]` for bug fixes."
|
||||
msgstr "`[Bugfix]` 用于错误修复。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:94
|
||||
#: ../../source/developer_guide/contribution/index.md:93
|
||||
msgid "`[Doc]` for documentation fixes and improvements."
|
||||
msgstr "`[Doc]` 用于文档修复和改进。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:95
|
||||
#: ../../source/developer_guide/contribution/index.md:94
|
||||
msgid "`[Test]` for tests (such as unit tests)."
|
||||
msgstr "`[Test]` 用于测试(例如单元测试)。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:96
|
||||
#: ../../source/developer_guide/contribution/index.md:95
|
||||
msgid "`[CI]` for build or continuous integration improvements."
|
||||
msgstr "`[CI]` 用于构建或持续集成的改进。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:97
|
||||
#: ../../source/developer_guide/contribution/index.md:96
|
||||
msgid ""
|
||||
"`[Misc]` for PRs that do not fit the above categories. Please use this "
|
||||
"sparingly."
|
||||
msgstr "`[Misc]` 用于不属于上述类别的 PR。请谨慎使用此标签。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:100
|
||||
#: ../../source/developer_guide/contribution/index.md:99
|
||||
msgid ""
|
||||
"If the PR spans more than one category, please include all relevant "
|
||||
"prefixes."
|
||||
msgstr "如果 PR 涉及多个类别,请包含所有相关的前缀。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:103
|
||||
#: ../../source/developer_guide/contribution/index.md:102
|
||||
msgid "Others"
|
||||
msgstr "其他"
|
||||
|
||||
#: ../../source/developer_guide/contribution/index.md:105
|
||||
#: ../../source/developer_guide/contribution/index.md:104
|
||||
msgid ""
|
||||
"You may find more information about contributing to vLLM Ascend backend "
|
||||
"plugin on "
|
||||
"[<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing). If "
|
||||
"you encounter any problems while contributing, feel free to submit a PR "
|
||||
"to improve the documentation to help other developers."
|
||||
msgstr "你可以在 [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing) 上找到有关为 vLLM Ascend 后端插件做贡献的更多信息。如果在贡献过程中遇到任何问题,欢迎随时提交 PR 来改进文档,以帮助其他开发者。"
|
||||
msgstr ""
|
||||
"你可以在 [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing) "
|
||||
"上找到有关为 vLLM Ascend 后端插件做贡献的更多信息。如果在贡献过程中遇到任何问题,欢迎随时提交 PR 来改进文档,以帮助其他开发者。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -27,7 +27,9 @@ msgstr "多节点测试"
|
||||
msgid ""
|
||||
"Multi-Node CI is designed to test distributed scenarios of very large "
|
||||
"models, eg: disaggregated_prefill multi DP across multi nodes and so on."
|
||||
msgstr "多节点CI旨在测试超大规模模型的分布式场景,例如:跨多节点的解耦预填充(disaggregated_prefill)、多数据并行(multi DP)等。"
|
||||
msgstr ""
|
||||
"多节点CI旨在测试超大规模模型的分布式场景,例如:跨多节点的解耦预填充(disaggregated_prefill)、多数据并行(multi "
|
||||
"DP)等。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:5
|
||||
msgid "How it works"
|
||||
@@ -39,7 +41,10 @@ msgid ""
|
||||
"CI mechanism. It shows how the GitHub action interacts with "
|
||||
"[lws](https://lws.sigs.k8s.io/docs/overview/) (a kind of kubernetes crd "
|
||||
"resource)."
|
||||
msgstr "下图展示了多节点CI机制的基本部署视图。它说明了GitHub Action如何与[lws](https://lws.sigs.k8s.io/docs/overview/)(一种Kubernetes CRD资源)进行交互。"
|
||||
msgstr ""
|
||||
"下图展示了多节点CI机制的基本部署视图。它说明了GitHub "
|
||||
"Action如何与[lws](https://lws.sigs.k8s.io/docs/overview/)(一种Kubernetes "
|
||||
"CRD资源)进行交互。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:9
|
||||
msgid ""
|
||||
@@ -62,7 +67,12 @@ msgid ""
|
||||
"[LWS_WORKER_INDEX](https://lws.sigs.k8s.io/docs/reference/labels-"
|
||||
"annotations-and-environment-variables/) environment variable, so that "
|
||||
"multiple nodes can form a distributed cluster to perform tasks."
|
||||
msgstr "从工作流的角度,我们可以看到最终的测试脚本是如何执行的。关键在于这两个文件:[lws.yaml和run.sh](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/scripts)。前者定义了我们的k8s集群如何被拉起,后者定义了Pod启动时的入口脚本。每个节点根据[LWS_WORKER_INDEX](https://lws.sigs.k8s.io/docs/reference/labels-annotations-and-environment-variables/)环境变量执行不同的逻辑,从而使多个节点能够组成一个分布式集群来执行任务。"
|
||||
msgstr ""
|
||||
"从工作流的角度,我们可以看到最终的测试脚本是如何执行的。关键在于这两个文件:[lws.yaml和run.sh](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/tree/main/tests/e2e/nightly/multi_node/scripts)。前者定义了我们的k8s集群如何被拉起,后者定义了Pod启动时的入口脚本。每个节点根据[LWS_WORKER_INDEX](https://lws.sigs.k8s.io/docs/reference"
|
||||
"/labels-annotations-and-environment-"
|
||||
"variables/)环境变量执行不同的逻辑,从而使多个节点能够组成一个分布式集群来执行任务。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:13
|
||||
msgid ""
|
||||
@@ -83,7 +93,10 @@ msgid ""
|
||||
"to ModelScope's [vllm-ascend](https://www.modelscope.cn/organization"
|
||||
"/vllm-ascend) organization is welcome. If you do not have permission to "
|
||||
"upload, please contact @Potabk"
|
||||
msgstr "如果您需要自定义权重,例如,您为DeepSeek-V3量化了一个w8a8权重,并希望您的权重能在CI上运行,欢迎将权重上传至ModelScope的[vllm-ascend](https://www.modelscope.cn/organization/vllm-ascend)组织。如果您没有上传权限,请联系@Potabk。"
|
||||
msgstr ""
|
||||
"如果您需要自定义权重,例如,您为DeepSeek-V3量化了一个w8a8权重,并希望您的权重能在CI上运行,欢迎将权重上传至ModelScope的"
|
||||
"[vllm-ascend](https://www.modelscope.cn/organization/vllm-"
|
||||
"ascend)组织。如果您没有上传权限,请联系@Potabk。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:21
|
||||
msgid "Add config yaml"
|
||||
@@ -100,7 +113,13 @@ msgid ""
|
||||
" just add \"yamls\" like [DeepSeek-V3.yaml](https://github.com/vllm-"
|
||||
"project/vllm-"
|
||||
"ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml)."
|
||||
msgstr "如入口脚本[run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106)所示,一个k8s Pod的启动意味着遍历[目录](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/)中的所有*.yaml文件,并根据不同的配置读取和执行。因此,我们需要做的就是添加类似[DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml)的\"yaml\"文件。"
|
||||
msgstr ""
|
||||
"如入口脚本[run.sh](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106)所示,一个k8s"
|
||||
" Pod的启动意味着遍历[目录](https://github.com/vllm-project/vllm-"
|
||||
"ascend/tree/main/tests/e2e/nightly/multi_node/config/)中的所有*.yaml文件,并根据不同的配置读取和执行。因此,我们需要做的就是添加类似[DeepSeek-V3.yaml](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml)的\"yaml\"文件。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:25
|
||||
msgid ""
|
||||
@@ -121,7 +140,9 @@ msgid ""
|
||||
"Currently, the multi-node test workflow is defined in the "
|
||||
"[nightly_test_a3.yaml](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)"
|
||||
msgstr "目前,多节点测试工作流定义在[nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)中。"
|
||||
msgstr ""
|
||||
"目前,多节点测试工作流定义在[nightly_test_a3.yaml](https://github.com/vllm-project"
|
||||
"/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)中。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:110
|
||||
msgid ""
|
||||
@@ -146,7 +167,9 @@ msgid ""
|
||||
"This section assumes that you already have a "
|
||||
"[Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment "
|
||||
"locally. Then you can easily start our test with one click."
|
||||
msgstr "本节假设您本地已经有一个[Kubernetes](https://kubernetes.io/docs/setup/) NPU集群环境。然后您可以轻松地一键启动我们的测试。"
|
||||
msgstr ""
|
||||
"本节假设您本地已经有一个[Kubernetes](https://kubernetes.io/docs/setup/) "
|
||||
"NPU集群环境。然后您可以轻松地一键启动我们的测试。"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:118
|
||||
msgid "Step 1. Install LWS CRD resources"
|
||||
@@ -159,7 +182,7 @@ msgid ""
|
||||
msgstr "参考<https://lws.sigs.k8s.io/docs/installation/>"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:122
|
||||
msgid "Step 2. Deploy the following yaml file `lws.yaml` as what you want"
|
||||
msgid "Step 2. Deploy the following yaml file `lws.yaml` as needed"
|
||||
msgstr "步骤 2. 按需部署以下yaml文件`lws.yaml`"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:258
|
||||
@@ -199,7 +222,10 @@ msgid ""
|
||||
"ascend/blob/e760aae1df7814073a4180172385505c1ec0fd83/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml#L25)"
|
||||
" after the configure item `num_nodes` , for example: `cluster_hosts: "
|
||||
"[\"xxx.xxx.xxx.188\", \"xxx.xxx.xxx.212\"]`"
|
||||
msgstr "在每个集群主机上进行修改,就像[DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/e760aae1df7814073a4180172385505c1ec0fd83/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml#L25)那样,在配置项`num_nodes`之后添加,例如:`cluster_hosts: [\"xxx.xxx.xxx.188\", \"xxx.xxx.xxx.212\"]`"
|
||||
msgstr ""
|
||||
"在每个集群主机上进行修改,就像[DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/e760aae1df7814073a4180172385505c1ec0fd83/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml#L25)那样,在配置项`num_nodes`之后添加,例如:`cluster_hosts:"
|
||||
" [\"xxx.xxx.xxx.188\", \"xxx.xxx.xxx.212\"]`"
|
||||
|
||||
#: ../../source/developer_guide/contribution/multi_node_test.md:321
|
||||
msgid "Step 2. Install develop environment"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -28,7 +28,9 @@ msgid ""
|
||||
"This document guides you to conduct accuracy testing using "
|
||||
"[AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench "
|
||||
"provides accuracy and performance evaluation for many datasets."
|
||||
msgstr "本文档指导您如何使用 [AISBench](https://gitee.com/aisbench/benchmark/tree/master) 进行精度测试。AISBench 为许多数据集提供了精度和性能评估。"
|
||||
msgstr ""
|
||||
"本文档指导您如何使用 [AISBench](https://gitee.com/aisbench/benchmark/tree/master) "
|
||||
"进行精度测试。AISBench 为许多数据集提供了精度和性能评估。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_ais_bench.md:5
|
||||
msgid "Online Server"
|
||||
@@ -68,7 +70,9 @@ msgstr "安装 AISBench"
|
||||
msgid ""
|
||||
"Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for"
|
||||
" details. Install AISBench from source."
|
||||
msgstr "详情请参考 [AISBench](https://gitee.com/aisbench/benchmark/tree/master)。从源码安装 AISBench。"
|
||||
msgstr ""
|
||||
"详情请参考 [AISBench](https://gitee.com/aisbench/benchmark/tree/master)。从源码安装 "
|
||||
"AISBench。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_ais_bench.md:69
|
||||
msgid "Install extra AISBench dependencies."
|
||||
@@ -96,7 +100,10 @@ msgid ""
|
||||
"[Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets)"
|
||||
" for more datasets. Each dataset has a `README.md` with detailed download"
|
||||
" and installation instructions."
|
||||
msgstr "以 `C-Eval` 数据集为例。更多数据集请参考 [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets)。每个数据集都有一个 `README.md` 文件,包含详细的下载和安装说明。"
|
||||
msgstr ""
|
||||
"以 `C-Eval` 数据集为例。更多数据集请参考 "
|
||||
"[Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets)。每个数据集都有一个"
|
||||
" `README.md` 文件,包含详细的下载和安装说明。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_ais_bench.md:86
|
||||
msgid "Download dataset and install it to specific path."
|
||||
@@ -136,7 +143,9 @@ msgid ""
|
||||
"`benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`."
|
||||
" There are several arguments that you should update according to your "
|
||||
"environment."
|
||||
msgstr "更新文件 `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`。有几个参数需要根据您的环境进行更新。"
|
||||
msgstr ""
|
||||
"更新文件 "
|
||||
"`benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`。有几个参数需要根据您的环境进行更新。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_ais_bench.md:158
|
||||
msgid ""
|
||||
@@ -169,9 +178,11 @@ msgstr "`host_ip` 和 `host_port`:更新为您的 vLLM 服务器的 IP 和端
|
||||
#: ../../source/developer_guide/evaluation/using_ais_bench.md:164
|
||||
msgid ""
|
||||
"`max_out_len`: Note `max_out_len` + LLM input length should be less than "
|
||||
"`max-model-len`(config in your vllm server), `32768` will be suitable for"
|
||||
"`max_model_len`(config in your vllm server), `32768` will be suitable for"
|
||||
" most datasets."
|
||||
msgstr "`max_out_len`:注意 `max_out_len` + LLM 输入长度应小于 `max-model-len`(在您的 vllm 服务器中配置),`32768` 适用于大多数数据集。"
|
||||
msgstr ""
|
||||
"`max_out_len`:注意 `max_out_len` + LLM 输入长度应小于 `max_model_len`(在您的 vllm "
|
||||
"服务器中配置),`32768` 适用于大多数数据集。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_ais_bench.md:165
|
||||
msgid "`batch_size`: Update according to your dataset."
|
||||
@@ -235,4 +246,7 @@ msgid ""
|
||||
"You need to manually replace the dataset image paths with absolute paths,"
|
||||
" changing `/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` "
|
||||
"to the actual absolute directory where the images are stored:"
|
||||
msgstr "您需要手动将数据集图像路径替换为绝对路径,将 `/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` 更改为图像存储的实际绝对目录:"
|
||||
msgstr ""
|
||||
"您需要手动将数据集图像路径替换为绝对路径,将 "
|
||||
"`/path/to/benchmark/ais_bench/datasets/textvqa/train_images/` "
|
||||
"更改为图像存储的实际绝对目录:"
|
||||
@@ -1,14 +1,8 @@
|
||||
# SOME DESCRIPTIVE TITLE.
|
||||
# Copyright (C) 2025, vllm-ascend team
|
||||
# This file is distributed under the same license as the vllm-ascend
|
||||
# package.
|
||||
# FIRST AUTHOR <EMAIL@ADDRESS>, 2025.
|
||||
#
|
||||
msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -65,8 +59,10 @@ msgid "3. Run GSM8K using EvalScope for accuracy testing"
|
||||
msgstr "3. 使用 EvalScope 运行 GSM8K 进行精度测试"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_evalscope.md:68
|
||||
msgid "You can use `evalscope eval` to run GSM8K for accuracy testing:"
|
||||
msgstr "你可以使用 `evalscope eval` 运行 GSM8K 进行精度测试:"
|
||||
msgid ""
|
||||
"You can use `evalscope eval` to run GSM8K (a grade-school math benchmark "
|
||||
"dataset) for accuracy testing:"
|
||||
msgstr "你可以使用 `evalscope eval` 运行 GSM8K(一个小学数学基准数据集)进行精度测试:"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_evalscope.md:80
|
||||
#: ../../source/developer_guide/evaluation/using_evalscope.md:117
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -44,9 +44,11 @@ msgid "The vLLM server is started successfully, if you see logs as below:"
|
||||
msgstr "如果您看到如下日志,则表示 vLLM 服务器已成功启动:"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:46
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:175
|
||||
msgid "2. Run GSM8K using lm-eval for accuracy testing"
|
||||
msgstr "2. 使用 lm-eval 运行 GSM8K 进行准确率测试"
|
||||
msgid ""
|
||||
"2. Run GSM8K using the vLLM server (curl) and then run lm-eval for "
|
||||
"accuracy testing"
|
||||
msgstr ""
|
||||
"2. 使用 vLLM 服务器(curl)运行 GSM8K,然后运行 lm-eval 进行准确率测试"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:48
|
||||
msgid "You can query the result with input prompts:"
|
||||
@@ -68,7 +70,10 @@ msgid ""
|
||||
"may cause lm-eval to download datasets from ModelScope instead of "
|
||||
"HuggingFace. Setting `USE_MODELSCOPE_HUB=0` disables this behavior so "
|
||||
"that lm-eval can fetch datasets from HuggingFace correctly."
|
||||
msgstr "Docker 容器以 `VLLM_USE_MODELSCOPE=True` 启动,这可能导致 lm-eval 从 ModelScope 而非 HuggingFace 下载数据集。设置 `USE_MODELSCOPE_HUB=0` 可禁用此行为,使 lm-eval 能够正确从 HuggingFace 获取数据集。"
|
||||
msgstr ""
|
||||
"Docker 容器以 `VLLM_USE_MODELSCOPE=True` 启动,这可能导致 lm-eval 从 ModelScope 而非 "
|
||||
"HuggingFace 下载数据集。设置 `USE_MODELSCOPE_HUB=0` 可禁用此行为,使 lm-eval 能够正确从 "
|
||||
"HuggingFace 获取数据集。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:120
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:192
|
||||
@@ -91,6 +96,10 @@ msgstr "1. 运行 docker 容器"
|
||||
msgid "You can run docker container on a single NPU:"
|
||||
msgstr "您可以在单个 NPU 上运行 docker 容器:"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:175
|
||||
msgid "2. Run GSM8K using lm-eval for accuracy testing"
|
||||
msgstr "2. 使用 lm-eval 运行 GSM8K 进行准确率测试"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_lm_eval.md:203
|
||||
msgid "After 1 to 2 minutes, the output is shown below:"
|
||||
msgstr "1 到 2 分钟后,输出如下所示:"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -50,7 +50,9 @@ msgid ""
|
||||
msgstr "服务器启动后,你可以在新的终端中使用输入提示词来查询模型。"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_opencompass.md:56
|
||||
msgid "2. Run C-Eval using OpenCompass for accuracy testing"
|
||||
msgid ""
|
||||
"2. Run C-Eval (a Chinese language model evaluation benchmark) using "
|
||||
"OpenCompass for accuracy testing"
|
||||
msgstr "2. 使用 OpenCompass 运行 C-Eval 进行准确率测试"
|
||||
|
||||
#: ../../source/developer_guide/evaluation/using_opencompass.md:58
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -57,14 +57,16 @@ msgid ""
|
||||
"series (Atlas-A3-cann-kernels) and Atlas 300I (Ascend-cann-kernels-310p) "
|
||||
"series are supported:"
|
||||
msgstr ""
|
||||
"目前,**仅**支持 Atlas A2 系列(Ascend-cann-kernels-910b)、Atlas A3 系列(Atlas-A3-cann-kernels)和 Atlas 300I(Ascend-cann-kernels-310p)系列:"
|
||||
"目前,**仅**支持 Atlas A2 系列(Ascend-cann-kernels-910b)、Atlas A3 系列(Atlas-A3"
|
||||
"-cann-kernels)和 Atlas 300I(Ascend-cann-kernels-310p)系列:"
|
||||
|
||||
#: ../../source/faqs.md:14
|
||||
msgid ""
|
||||
"Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 "
|
||||
"Box16, Atlas 300T A2)"
|
||||
msgstr ""
|
||||
"Atlas A2 训练系列(Atlas 800T A2、Atlas 900 A2 PoD、Atlas 200T A2 Box16、Atlas 300T A2)"
|
||||
"Atlas A2 训练系列(Atlas 800T A2、Atlas 900 A2 PoD、Atlas 200T A2 Box16、Atlas "
|
||||
"300T A2)"
|
||||
|
||||
#: ../../source/faqs.md:15
|
||||
msgid "Atlas 800I A2 Inference series (Atlas 800I A2)"
|
||||
@@ -74,8 +76,7 @@ msgstr "Atlas 800I A2 推理系列(Atlas 800I A2)"
|
||||
msgid ""
|
||||
"Atlas A3 Training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas "
|
||||
"9000 A3 SuperPoD)"
|
||||
msgstr ""
|
||||
"Atlas A3 训练系列(Atlas 800T A3、Atlas 900 A3 SuperPoD、Atlas 9000 A3 SuperPoD)"
|
||||
msgstr "Atlas A3 训练系列(Atlas 800T A3、Atlas 900 A3 SuperPoD、Atlas 9000 A3 SuperPoD)"
|
||||
|
||||
#: ../../source/faqs.md:17
|
||||
msgid "Atlas 800I A3 Inference series (Atlas 800I A3)"
|
||||
@@ -109,7 +110,8 @@ msgid ""
|
||||
"supported. Otherwise, we have to implement it by using custom ops. We "
|
||||
"also welcome you to join us to improve together."
|
||||
msgstr ""
|
||||
"从技术角度看,如果 torch-npu 支持某设备,则 vllm-ascend 也支持该设备。否则,我们需要通过自定义算子来实现。我们也欢迎您加入我们,共同改进。"
|
||||
"从技术角度看,如果 torch-npu 支持某设备,则 vllm-ascend "
|
||||
"也支持该设备。否则,我们需要通过自定义算子来实现。我们也欢迎您加入我们,共同改进。"
|
||||
|
||||
#: ../../source/faqs.md:28
|
||||
msgid "2. How to get our docker containers?"
|
||||
@@ -158,8 +160,7 @@ msgstr "3. vllm-ascend 支持哪些模型?"
|
||||
msgid ""
|
||||
"Find more details "
|
||||
"[<u>here</u>](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)."
|
||||
msgstr ""
|
||||
"更多详细信息请参见[<u>此处</u>](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)。"
|
||||
msgstr "更多详细信息请参见[<u>此处</u>](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)。"
|
||||
|
||||
#: ../../source/faqs.md:74
|
||||
msgid "4. How to get in touch with our community?"
|
||||
@@ -190,13 +191,17 @@ msgstr "参加我们的[<u>每周例会</u>](https://docs.google.com/document/d/
|
||||
msgid ""
|
||||
"Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/227) group and ask your questions."
|
||||
msgstr "加入我们的[<u>微信群</u>](https://github.com/vllm-project/vllm-ascend/issues/227)并提出您的问题。"
|
||||
msgstr ""
|
||||
"加入我们的[<u>微信群</u>](https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/227)并提出您的问题。"
|
||||
|
||||
#: ../../source/faqs.md:81
|
||||
msgid ""
|
||||
"Join our ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c"
|
||||
"/hardware-support/vllm-ascend-support/6) and publish your topics."
|
||||
msgstr "加入我们在 [<u>vLLM 论坛</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) 的 ascend 频道并发布您的主题。"
|
||||
msgstr ""
|
||||
"加入我们在 [<u>vLLM 论坛</u>](https://discuss.vllm.ai/c/hardware-support/vllm-"
|
||||
"ascend-support/6) 的 ascend 频道并发布您的主题。"
|
||||
|
||||
#: ../../source/faqs.md:83
|
||||
msgid "5. What features does vllm-ascend V1 supports?"
|
||||
@@ -206,8 +211,7 @@ msgstr "5. vllm-ascend V1 支持哪些功能?"
|
||||
msgid ""
|
||||
"Find more details "
|
||||
"[<u>here</u>](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html)."
|
||||
msgstr ""
|
||||
"更多详细信息请参见[<u>此处</u>](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html)。"
|
||||
msgstr "更多详细信息请参见[<u>此处</u>](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html)。"
|
||||
|
||||
#: ../../source/faqs.md:87
|
||||
msgid ""
|
||||
@@ -256,7 +260,9 @@ msgid ""
|
||||
"0.9.1, you should use vllm-ascend 0.9.1 as well. For the main branch, we "
|
||||
"ensure that `vllm-ascend` and `vllm` are compatible at every commit."
|
||||
msgstr ""
|
||||
"`vllm-ascend` 是 vLLM 的一个硬件插件。`vllm-ascend` 的版本与 `vllm` 的版本相同。例如,如果您使用 `vllm` 0.9.1,您也应该使用 vllm-ascend 0.9.1。对于主分支,我们确保 `vllm-ascend` 和 `vllm` 在每次提交时都是兼容的。"
|
||||
"`vllm-ascend` 是 vLLM 的一个硬件插件。`vllm-ascend` 的版本与 `vllm` 的版本相同。例如,如果您使用 "
|
||||
"`vllm` 0.9.1,您也应该使用 vllm-ascend 0.9.1。对于主分支,我们确保 `vllm-ascend` 和 `vllm` "
|
||||
"在每次提交时都是兼容的。"
|
||||
|
||||
#: ../../source/faqs.md:109
|
||||
msgid "8. Does vllm-ascend support Prefill Disaggregation feature?"
|
||||
@@ -269,7 +275,8 @@ msgid ""
|
||||
"tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)"
|
||||
" for example."
|
||||
msgstr ""
|
||||
"是的,vllm-ascend 支持通过 Mooncake 后端实现 Prefill Disaggregation 功能。示例请参见[官方教程](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)。"
|
||||
"是的,vllm-ascend 支持通过 Mooncake 后端实现 Prefill Disaggregation "
|
||||
"功能。示例请参见[官方教程](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)。"
|
||||
|
||||
#: ../../source/faqs.md:113
|
||||
msgid "9. Does vllm-ascend support quantization method?"
|
||||
@@ -299,7 +306,8 @@ msgid ""
|
||||
"features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html)"
|
||||
" through E2E test."
|
||||
msgstr ""
|
||||
"**功能测试**:我们添加了 CI,包括部分 vllm 的原生单元测试和 vllm-ascend 自身的单元测试。在 vllm-ascend 的测试中,我们通过端到端测试来验证基本功能、主流模型的可用性以及[支持的功能](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html)。"
|
||||
"**功能测试**:我们添加了 CI,包括部分 vllm 的原生单元测试和 vllm-ascend 自身的单元测试。在 vllm-ascend "
|
||||
"的测试中,我们通过端到端测试来验证基本功能、主流模型的可用性以及[支持的功能](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html)。"
|
||||
|
||||
#: ../../source/faqs.md:123
|
||||
msgid ""
|
||||
@@ -308,13 +316,14 @@ msgid ""
|
||||
"benchmark, which can be easily re-run locally. We will publish a perf "
|
||||
"website to show the performance test results for each pull request."
|
||||
msgstr ""
|
||||
"**性能测试**:我们提供了用于端到端性能基准测试的[基准测试](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)工具,可以方便地在本地重新运行。我们将发布一个性能网站,展示每个拉取请求的性能测试结果。"
|
||||
"**性能测试**:我们提供了用于端到端性能基准测试的[基准测试](https://github.com/vllm-project/vllm-"
|
||||
"ascend/tree/main/benchmarks)工具,可以方便地在本地重新运行。我们将发布一个性能网站,展示每个拉取请求的性能测试结果。"
|
||||
|
||||
#: ../../source/faqs.md:125
|
||||
msgid ""
|
||||
"**Accuracy test**: We are working on adding accuracy test to the CI as "
|
||||
"well."
|
||||
msgstr "**准确性测试**:我们正在努力将准确性测试也添加到CI中。"
|
||||
msgstr "**准确性测试**:我们正在努力将准确性测试也添加到 CI 中。"
|
||||
|
||||
#: ../../source/faqs.md:127
|
||||
msgid ""
|
||||
@@ -341,8 +350,9 @@ msgid ""
|
||||
"to the version of the vLLM package you have installed. The format of "
|
||||
"`VLLM_VERSION` should be `X.Y.Z`."
|
||||
msgstr ""
|
||||
"此问题通常是由于安装了开发版或可编辑版本的 vLLM 包引起的。为此,我们提供了环境变量 `VLLM_VERSION`,允许用户指定要使用的 vLLM "
|
||||
"包版本。请将环境变量 `VLLM_VERSION` 设置为你已安装的 vLLM 包的版本。`VLLM_VERSION` 的格式应为 `X.Y.Z`。"
|
||||
"此问题通常是由于安装了开发版或可编辑版本的 vLLM 包引起的。为此,我们提供了环境变量 `VLLM_VERSION`,允许用户指定要使用的 "
|
||||
"vLLM 包版本。请将环境变量 `VLLM_VERSION` 设置为你已安装的 vLLM 包的版本。`VLLM_VERSION` 的格式应为 "
|
||||
"`X.Y.Z`。"
|
||||
|
||||
#: ../../source/faqs.md:135
|
||||
msgid "12. How to handle the out-of-memory issue?"
|
||||
@@ -356,20 +366,22 @@ msgid ""
|
||||
"documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-"
|
||||
"of-memory)."
|
||||
msgstr ""
|
||||
"当模型超出单个 NPU 的内存容量时,通常会发生 OOM(内存不足)错误。一般性指导可参考 [vLLM OOM 故障排除文档](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory)。"
|
||||
"当模型超出单个 NPU 的内存容量时,通常会发生 OOM(内存不足)错误。一般性指导可参考 [vLLM OOM "
|
||||
"故障排除文档](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-"
|
||||
"memory)。"
|
||||
|
||||
#: ../../source/faqs.md:139
|
||||
msgid ""
|
||||
"In scenarios where NPUs have limited high bandwidth memory (HBM) "
|
||||
"capacity, dynamic memory allocation/deallocation during inference can "
|
||||
"exacerbate memory fragmentation, leading to OOM. To address this:"
|
||||
msgstr "在 NPU 的高带宽内存容量有限的场景下,推理过程中的动态内存分配/释放会加剧内存碎片,导致 OOM。为解决此问题:"
|
||||
"In scenarios where NPUs have limited high bandwidth memory (on-chip "
|
||||
"memory) capacity, dynamic memory allocation/deallocation during inference"
|
||||
" can exacerbate memory fragmentation, leading to OOM. To address this:"
|
||||
msgstr "在 NPU 的高带宽内存(片上内存)容量有限的场景下,推理过程中的动态内存分配/释放会加剧内存碎片,导致 OOM。为解决此问题:"
|
||||
|
||||
#: ../../source/faqs.md:141
|
||||
msgid ""
|
||||
"**Limit `--max-model-len`**: It can save the HBM usage for KV cache "
|
||||
"initialization step."
|
||||
msgstr "**限制 `--max-model-len`**:它可以节省 KV 缓存初始化步骤的 HBM 使用量。"
|
||||
"**Limit `--max-model-len`**: It can save the on-chip memory usage for KV "
|
||||
"cache initialization step."
|
||||
msgstr "**限制 `--max-model-len`**:它可以节省 KV 缓存初始化步骤的片上内存使用量。"
|
||||
|
||||
#: ../../source/faqs.md:143
|
||||
msgid ""
|
||||
@@ -379,7 +391,9 @@ msgid ""
|
||||
"Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-"
|
||||
"utilization)."
|
||||
msgstr ""
|
||||
"**调整 `--gpu-memory-utilization`**:如果未指定,默认值为 `0.9`。你可以降低此值以预留更多内存,从而减少碎片风险。详情参见:[vLLM - 推理与服务 - 引擎参数](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization)。"
|
||||
"**调整 `--gpu-memory-utilization`**:如果未指定,默认值为 "
|
||||
"`0.9`。你可以降低此值以预留更多内存,从而减少碎片风险。详情参见:[vLLM - 推理与服务 - "
|
||||
"引擎参数](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization)。"
|
||||
|
||||
#: ../../source/faqs.md:145
|
||||
msgid ""
|
||||
@@ -401,22 +415,24 @@ msgstr "13. 运行 DeepSeek 时无法启用 NPU 图模式"
|
||||
#: ../../source/faqs.md:149
|
||||
msgid ""
|
||||
"Enabling NPU graph mode for DeepSeek may trigger an error. This is "
|
||||
"because when both MLA and NPU graph mode are active, the number of "
|
||||
"queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has"
|
||||
" only 16 attention heads, which results in 16 queries per KV—a value "
|
||||
"outside the supported range. Support for NPU graph mode on "
|
||||
"DeepSeek-V2-Lite will be added in a future update."
|
||||
"because when both MLA (Multi-Head Latent Attention) and NPU graph mode "
|
||||
"are active, the number of queries per KV head must be 32, 64, or 128. "
|
||||
"However, DeepSeek-V2-Lite has only 16 attention heads, which results in "
|
||||
"16 queries per KV—a value outside the supported range. Support for NPU "
|
||||
"graph mode on DeepSeek-V2-Lite will be added in a future update."
|
||||
msgstr ""
|
||||
"为 DeepSeek 启用 NPU 图模式可能会触发错误。这是因为当 MLA 和 NPU 图模式同时激活时,每个 KV 头的查询数必须为 32、64 或 "
|
||||
"128。然而,DeepSeek-V2-Lite 只有 16 个注意力头,导致每个 KV 有 16 个查询,该值超出了支持范围。对 "
|
||||
"为 DeepSeek 启用 NPU 图模式可能会触发错误。这是因为当 MLA(多头潜在注意力)和 NPU 图模式同时激活时,每个 KV 头的查询数必须为 "
|
||||
"32、64 或 128。然而,DeepSeek-V2-Lite 只有 16 个注意力头,导致每个 KV 有 16 个查询,该值超出了支持范围。对 "
|
||||
"DeepSeek-V2-Lite 的 NPU 图模式支持将在未来的更新中添加。"
|
||||
|
||||
#: ../../source/faqs.md:151
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after "
|
||||
"the tensor parallel split, `num_heads`/`num_kv_heads` is {32, 64, 128}."
|
||||
msgstr ""
|
||||
"如果你正在使用 DeepSeek-V3 或 DeepSeek-R1,请确保在张量并行切分后,`num_heads`/`num_kv_heads` 的值为 {32, 64, 128} 中的一个。"
|
||||
"如果你正在使用 DeepSeek-V3 或 DeepSeek-R1,请确保在张量并行切分后,`num_heads`/`num_kv_heads` "
|
||||
"的值为 {32, 64, 128} 中的一个。"
|
||||
|
||||
#: ../../source/faqs.md:158
|
||||
msgid ""
|
||||
@@ -431,7 +447,8 @@ msgid ""
|
||||
"fails, use `python setup.py install` (recommended) to install, or use "
|
||||
"`python setup.py clean` to clear the cache."
|
||||
msgstr ""
|
||||
"使用 pip 从源码重新安装 vllm-ascend 时,可能会遇到 C/C++ 编译失败的问题。如果安装失败,请使用 `python setup.py install`(推荐)进行安装,或使用 `python setup.py clean` 清除缓存。"
|
||||
"使用 pip 从源码重新安装 vllm-ascend 时,可能会遇到 C/C++ 编译失败的问题。如果安装失败,请使用 `python "
|
||||
"setup.py install`(推荐)进行安装,或使用 `python setup.py clean` 清除缓存。"
|
||||
|
||||
#: ../../source/faqs.md:162
|
||||
msgid "15. How to generate deterministic results when using vllm-ascend?"
|
||||
@@ -445,8 +462,7 @@ msgstr "有几个因素会影响输出的确定性:"
|
||||
msgid ""
|
||||
"Sampler method: using **greedy sampling** by setting `temperature=0` in "
|
||||
"`SamplingParams`, e.g.:"
|
||||
msgstr ""
|
||||
"采样方法:通过在 `SamplingParams` 中设置 `temperature=0` 来使用 **贪婪采样**,例如:"
|
||||
msgstr "采样方法:通过在 `SamplingParams` 中设置 `temperature=0` 来使用 **贪婪采样**,例如:"
|
||||
|
||||
#: ../../source/faqs.md:191
|
||||
msgid "Set the following environment parameters:"
|
||||
@@ -456,7 +472,9 @@ msgstr "设置以下环境参数:"
|
||||
msgid ""
|
||||
"16. How to fix the error \"ImportError: Please install vllm[audio] for "
|
||||
"audio support\" for the Qwen2.5-Omni model?"
|
||||
msgstr "16. 对于 Qwen2.5-Omni 模型,如何修复 \"ImportError: Please install vllm[audio] for audio support\" 错误?"
|
||||
msgstr ""
|
||||
"16. 对于 Qwen2.5-Omni 模型,如何修复 \"ImportError: Please install vllm[audio] for"
|
||||
" audio support\" 错误?"
|
||||
|
||||
#: ../../source/faqs.md:202
|
||||
msgid ""
|
||||
@@ -467,7 +485,9 @@ msgid ""
|
||||
"`ImportError: No module named 'librosa'` issue and ensuring that the "
|
||||
"audio processing functionality works correctly."
|
||||
msgstr ""
|
||||
"`Qwen2.5-Omni` 模型需要安装 `librosa` 包,你需要安装 `qwen-omni-utils` 包以确保满足所有依赖,运行 `pip install qwen-omni-utils`。此包将安装 `librosa` 及其相关依赖,解决 `ImportError: No module named 'librosa'` 问题,并确保音频处理功能正常工作。"
|
||||
"`Qwen2.5-Omni` 模型需要安装 `librosa` 包,你需要安装 `qwen-omni-utils` 包以确保满足所有依赖,运行 "
|
||||
"`pip install qwen-omni-utils`。此包将安装 `librosa` 及其相关依赖,解决 `ImportError: No "
|
||||
"module named 'librosa'` 问题,并确保音频处理功能正常工作。"
|
||||
|
||||
#: ../../source/faqs.md:205
|
||||
msgid ""
|
||||
@@ -480,10 +500,13 @@ msgid "Recommended mitigation strategies:"
|
||||
msgstr "推荐的缓解策略:"
|
||||
|
||||
#: ../../source/faqs.md:215
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"Manually configure the compilation_config parameter with a reduced size "
|
||||
"set: '{\"cudagraph_capture_sizes\":[size1, size2, size3, ...]}'."
|
||||
msgstr "手动配置 compilation_config 参数,使用缩减后的尺寸集合:'{\"cudagraph_capture_sizes\":[size1, size2, size3, ...]}'。"
|
||||
msgstr ""
|
||||
"手动配置 compilation_config "
|
||||
"参数,使用缩减后的尺寸集合:'{\"cudagraph_capture_sizes\":[size1, size2, size3, ...]}'。"
|
||||
|
||||
#: ../../source/faqs.md:216
|
||||
msgid ""
|
||||
@@ -502,7 +525,9 @@ msgid ""
|
||||
"additional streams outside of this calculation framework, resulting in "
|
||||
"stream resource exhaustion during size capture operations."
|
||||
msgstr ""
|
||||
"根本原因分析:当前尺寸捕获的流需求计算仅考虑了可测量的因素,包括:数据并行大小、张量并行大小、专家并行配置、分段图数量、多流重叠共享专家设置以及 HCCL 通信模式(AIV/AICPU)。然而,许多不可量化的因素,例如算子特性和特定硬件特性,在此计算框架之外消耗了额外的流,导致尺寸捕获操作期间流资源耗尽。"
|
||||
"根本原因分析:当前尺寸捕获的流需求计算仅考虑了可测量的因素,包括:数据并行大小、张量并行大小、专家并行配置、分段图数量、多流重叠共享专家设置以及 "
|
||||
"HCCL "
|
||||
"通信模式(AIV/AICPU)。然而,许多不可量化的元素,例如算子特性和特定硬件特性,在此计算框架之外消耗了额外的流,导致尺寸捕获操作期间流资源耗尽。"
|
||||
|
||||
#: ../../source/faqs.md:221
|
||||
msgid "18. How to install custom version of torch_npu?"
|
||||
@@ -513,7 +538,9 @@ msgid ""
|
||||
"torch-npu will be overridden when installing vllm-ascend. If you need to"
|
||||
" install a specific version of torch-npu, you can manually install the "
|
||||
"specified version of torch-npu after vllm-ascend is installed."
|
||||
msgstr "安装 vllm-ascend 时会覆盖 torch-npu。如果你需要安装特定版本的 torch-npu,可以在 vllm-ascend 安装后手动安装指定版本的 torch-npu。"
|
||||
msgstr ""
|
||||
"安装 vllm-ascend 时会覆盖 torch-npu。如果你需要安装特定版本的 torch-npu,可以在 vllm-ascend "
|
||||
"安装后手动安装指定版本的 torch-npu。"
|
||||
|
||||
#: ../../source/faqs.md:225
|
||||
msgid ""
|
||||
@@ -565,7 +592,9 @@ msgid ""
|
||||
"security risk. Only use this option if you understand the implications "
|
||||
"and trust the container's source."
|
||||
msgstr ""
|
||||
"使用 `--shm-size` 时,你可能需要在 `docker run` 命令中添加 `--privileged=true` 标志,以授予容器必要的权限。请注意,使用 `--privileged=true` 会授予容器在主机系统上的广泛权限,这可能带来安全风险。只有在理解其影响并信任容器来源的情况下才使用此选项。"
|
||||
"使用 `--shm-size` 时,你可能需要在 `docker run` 命令中添加 `--privileged=true` "
|
||||
"标志,以授予容器必要的权限。请注意,使用 `--privileged=true` "
|
||||
"会授予容器在主机系统上的广泛权限,这可能带来安全风险。只有在理解其影响并信任容器来源的情况下才使用此选项。"
|
||||
|
||||
#: ../../source/faqs.md:256
|
||||
msgid "21. How to achieve low latency in a small batch scenario?"
|
||||
@@ -580,7 +609,10 @@ msgid ""
|
||||
"`tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it"
|
||||
" using the following instruction:"
|
||||
msgstr ""
|
||||
"`torch_npu.npu_fused_infer_attention_score` 在小批量场景下的性能不理想,主要是由于缺乏 Flash Decoding 功能。我们在 `tools/install_flash_infer_attention_score_ops_a2.sh` 和 `tools/install_flash_infer_attention_score_ops_a3.sh` 中提供了一个替代算子,你可以使用以下指令安装它:"
|
||||
"`torch_npu.npu_fused_infer_attention_score` 在小批量场景下的性能不理想,主要是由于缺乏 Flash "
|
||||
"Decoding 功能。我们在 `tools/install_flash_infer_attention_score_ops_a2.sh` 和 "
|
||||
"`tools/install_flash_infer_attention_score_ops_a3.sh` "
|
||||
"中提供了一个替代算子,你可以使用以下指令安装它:"
|
||||
|
||||
#: ../../source/faqs.md:266
|
||||
msgid ""
|
||||
@@ -593,7 +625,12 @@ msgid ""
|
||||
"create one. If you're not the root user, you need `sudo` **privileges** "
|
||||
"to run this script."
|
||||
msgstr ""
|
||||
"**注意**:使用此方法时不要设置 `additional_config.pa_shape_list`;否则会导致使用另一个注意力算子。**重要**:请确保你使用的是 `vllm-ascend` 的**官方镜像**;否则,你**必须将** `tools/install_flash_infer_attention_score_ops_a2.sh` 或 `tools/install_flash_infer_attention_score_ops_a3.sh` 中的目录 `/vllm-workspace` **更改为你自己的目录**,或者创建一个。如果你不是 root 用户,则需要 `sudo` **权限**来运行此脚本。"
|
||||
"**注意**:使用此方法时不要设置 "
|
||||
"`additional_config.pa_shape_list`;否则会导致使用另一个注意力算子。**重要**:请确保你使用的是 `vllm-"
|
||||
"ascend` 的**官方镜像**;否则,你**必须将** "
|
||||
"`tools/install_flash_infer_attention_score_ops_a2.sh` 或 "
|
||||
"`tools/install_flash_infer_attention_score_ops_a3.sh` 中的目录 `/vllm-"
|
||||
"workspace` **更改为你自己的目录**,或者创建一个。如果你不是 root 用户,则需要 `sudo` **权限**来运行此脚本。"
|
||||
|
||||
#: ../../source/faqs.md:269
|
||||
msgid ""
|
||||
@@ -608,7 +645,8 @@ msgid ""
|
||||
"(common in CPU-only build environments), you must set `SOC_VERSION` "
|
||||
"manually before installation."
|
||||
msgstr ""
|
||||
"从源码构建时(例如 `pip install -e .`),构建过程可能会尝试通过 `npu-smi` 推断目标芯片。如果 `npu-smi` 不可用(在仅含 CPU 的构建环境中很常见),则必须在安装前手动设置 `SOC_VERSION`。"
|
||||
"从源码构建时(例如 `pip install -e .`),构建过程可能会尝试通过 `npu-smi` 推断目标芯片。如果 `npu-smi` "
|
||||
"不可用(在仅含 CPU 的构建环境中很常见),则必须在安装前手动设置 `SOC_VERSION`。"
|
||||
|
||||
#: ../../source/faqs.md:273
|
||||
msgid "You can use the defaults from `Dockerfile*` as a reference. For example:"
|
||||
@@ -626,7 +664,9 @@ msgid ""
|
||||
"issue, please use the official docker images or install the specific "
|
||||
"triton-ascend version as following:"
|
||||
msgstr ""
|
||||
"如 [#7782](https://github.com/vllm-project/vllm-ascend/issues/7782) 所示,triton-ascend 偶尔会遇到编译错误,这是 triton-ascend 3.2.0 中的一个已知问题。为避免此问题,请使用官方 docker 镜像或按以下方式安装特定的 triton-ascend 版本:"
|
||||
"如 [#7782](https://github.com/vllm-project/vllm-ascend/issues/7782) 所示"
|
||||
",triton-ascend 偶尔会遇到编译错误,这是 triton-ascend 3.2.0 中的一个已知问题。为避免此问题,请使用官方 "
|
||||
"docker 镜像或按以下方式安装特定的 triton-ascend 版本:"
|
||||
|
||||
#: ../../source/faqs.md:300
|
||||
msgid "24. Why TPOT increases drastically as concurrency grows?"
|
||||
@@ -647,14 +687,20 @@ msgid ""
|
||||
"the future, which is why the performance might drop significantly. There "
|
||||
"are several ways to verify this:"
|
||||
msgstr ""
|
||||
"在测试 vLLM 服务器时,可能会发现 TPOT 随着并发度的增加而增加(例如,并发度增加 4 时,TPOT 增加 0.5 ~ 1ms)。在大多数情况下,这种现象是正常的。然而,有时随着并发度的增长,TPOT 可能会急剧增加(例如增加 10 到 100ms)。这可能是由 vLLM 中的 [**抢占**](https://docs.vllm.ai/en/latest/configuration/optimization/#preemption) 引起的。通常,当服务器达到 KV 缓存限制时,vLLM 会尝试释放请求的 KV 缓存,以确保为其他请求提供足够的空间,这在 vLLM 中称为抢占。当一个请求被抢占时,默认行为是在未来重新计算该请求的 KV 缓存,这就是性能可能显著下降的原因。有几种方法可以验证这一点:"
|
||||
"在测试 vLLM 服务器时,可能会发现 TPOT 随着并发度的增加而增加(例如,并发度增加 4 时,TPOT 增加 0.5 ~ "
|
||||
"1ms)。在大多数情况下,这种现象是正常的。然而,有时随着并发度的增长,TPOT 可能会急剧增加(例如增加 10 到 100ms)。这可能是由 "
|
||||
"vLLM 中的 "
|
||||
"[**抢占**](https://docs.vllm.ai/en/latest/configuration/optimization/#preemption)"
|
||||
" 引起的。通常,当服务器达到 KV 缓存限制时,vLLM 会尝试释放请求的 KV 缓存,以确保为其他请求提供足够的空间,这在 vLLM "
|
||||
"中称为抢占。当一个请求被抢占时,默认行为是在未来重新计算该请求的 KV 缓存,这就是性能可能显著下降的原因。有几种方法可以验证这一点:"
|
||||
|
||||
#: ../../source/faqs.md:305
|
||||
msgid ""
|
||||
"vLLM usually logs stats on your server. You might see metrics like `GPU "
|
||||
"KV cache usage: 99.0%,`. When reaching 100%, it triggers preemption."
|
||||
msgstr ""
|
||||
"vLLM 通常会在服务器上记录统计信息。您可能会看到类似 `GPU KV cache usage: 99.0%,` 的指标。当达到 100% 时,会触发抢占。"
|
||||
"vLLM 通常会在服务器上记录统计信息。您可能会看到类似 `GPU KV cache usage: 99.0%,` 的指标。当达到 100% "
|
||||
"时,会触发抢占。"
|
||||
|
||||
#: ../../source/faqs.md:306
|
||||
msgid ""
|
||||
@@ -663,7 +709,9 @@ msgid ""
|
||||
"4.05`. These are estimated KV cache capacity for a single DP group. You "
|
||||
"can adjust the overall request traffic according to this."
|
||||
msgstr ""
|
||||
"启动 vLLM 服务器时,您会看到类似 `GPU KV cache size: 66340 tokens` 和 `Maximum concurrency for 16,384 tokens per request: 4.05` 的日志。这些是针对单个 DP 组的估计 KV 缓存容量。您可以据此调整总体请求流量。"
|
||||
"启动 vLLM 服务器时,您会看到类似 `GPU KV cache size: 66340 tokens` 和 `Maximum "
|
||||
"concurrency for 16,384 tokens per request: 4.05` 的日志。这些是针对单个 DP 组的估计 KV "
|
||||
"缓存容量。您可以据此调整总体请求流量。"
|
||||
|
||||
#: ../../source/faqs.md:308
|
||||
msgid ""
|
||||
@@ -675,7 +723,10 @@ msgid ""
|
||||
"can increase `--gpu-memory-utilization` or decrease `--max-num-seqs` && "
|
||||
"`--max-num-batched-tokens`."
|
||||
msgstr ""
|
||||
"抢占无法完全避免,因为 KV 缓存的使用总是有限制的。但有方法可以减少抢占的发生几率。正如 [**抢占**](https://docs.vllm.ai/en/latest/configuration/optimization/#preemption) 中所建议的,核心策略是增加可用的 KV 缓存。例如,可以增加 `--gpu-memory-utilization` 或减少 `--max-num-seqs` 和 `--max-num-batched-tokens`。"
|
||||
"抢占无法完全避免,因为 KV 缓存的使用总是有限制的。但有方法可以减少抢占的发生几率。正如 "
|
||||
"[**抢占**](https://docs.vllm.ai/en/latest/configuration/optimization/#preemption)"
|
||||
" 中所建议的,核心策略是增加可用的 KV 缓存。例如,可以增加 `--gpu-memory-utilization` 或减少 `--max-"
|
||||
"num-seqs` 和 `--max-num-batched-tokens`。"
|
||||
|
||||
#~ msgid ""
|
||||
#~ "[[v0.7.3.post1] FAQ & Feedback](https://github.com"
|
||||
@@ -701,7 +752,8 @@ msgstr ""
|
||||
#~ "目前,只有部分模型得到了改进,例如 `Qwen2.5 VL`、`Qwen3` 和 "
|
||||
#~ "`Deepseek V3`。其他模型的效果还不够理想。从 0.9.0rc2 版本开始,Qwen "
|
||||
#~ "和 Deepseek 已支持图模式,以获得更好的性能。此外,您还可以在 `vllm-"
|
||||
#~ "ascend v0.7.3` 上安装 `mindie-turbo` 来进一步加速推理。"
|
||||
#~ "ascend v0.7.3` 上安装 `mindie-turbo` "
|
||||
#~ "来进一步加速推理。"
|
||||
|
||||
#~ msgid ""
|
||||
#~ "Currently, only 1P1D is supported on "
|
||||
@@ -721,7 +773,11 @@ msgstr ""
|
||||
#~ " use `pip install vllm-ascend[mindie-"
|
||||
#~ "turbo]`."
|
||||
#~ msgstr ""
|
||||
#~ "目前,w8a8 量化已在 v0.8.4rc2 或更高版本的 vllm-ascend 中原生支持。如果您使用的是 vllm 0.7.3 版本,通过集成 vllm-ascend 和 mindie-turbo 也支持 w8a8 量化,请使用 `pip install vllm-ascend[mindie-turbo]`。"
|
||||
#~ "目前,w8a8 量化已在 v0.8.4rc2 或更高版本的 vllm-"
|
||||
#~ "ascend 中原生支持。如果您使用的是 vllm 0.7.3 版本,通过集成 "
|
||||
#~ "vllm-ascend 和 mindie-turbo 也支持 w8a8"
|
||||
#~ " 量化,请使用 `pip install vllm-ascend[mindie-"
|
||||
#~ "turbo]`。"
|
||||
|
||||
#~ msgid "11. How to run w8a8 DeepSeek model?"
|
||||
#~ msgstr "11. 如何运行 w8a8 DeepSeek 模型?"
|
||||
@@ -733,7 +789,8 @@ msgstr ""
|
||||
#~ " replace model to DeepSeek."
|
||||
#~ msgstr ""
|
||||
#~ "请按照[推理教程](https://vllm-"
|
||||
#~ "ascend.readthedocs.io/en/latest/tutorials/multi_node.html)进行操作,并将模型替换为 DeepSeek。"
|
||||
#~ "ascend.readthedocs.io/en/latest/tutorials/multi_node.html)进行操作,并将模型替换为"
|
||||
#~ " DeepSeek。"
|
||||
|
||||
#~ msgid ""
|
||||
#~ "12. There is no output in log "
|
||||
@@ -750,7 +807,10 @@ msgstr ""
|
||||
#~ "pick it locally by yourself. Otherwise,"
|
||||
#~ " please fill up an issue."
|
||||
#~ msgstr ""
|
||||
#~ "如果您使用的是 vllm 0.7.3 版本,这是 VLLM 中一个已知的进度条显示问题,已在 [此 PR](https://github.com/vllm-project/vllm/pull/12428) 中解决,请自行在本地进行 cherry-pick。否则,请提交一个 issue。"
|
||||
#~ "如果您使用的是 vllm 0.7.3 版本,这是 VLLM "
|
||||
#~ "中一个已知的进度条显示问题,已在 [此 PR](https://github.com/vllm-"
|
||||
#~ "project/vllm/pull/12428) 中解决,请自行在本地进行 cherry-"
|
||||
#~ "pick。否则,请提交一个 issue。"
|
||||
|
||||
#~ msgid ""
|
||||
#~ "You may encounter the following error"
|
||||
@@ -765,4 +825,7 @@ msgstr ""
|
||||
#~ "DeepSeek-V2-Lite will be done in the "
|
||||
#~ "future."
|
||||
#~ msgstr ""
|
||||
#~ "如果在启用 NPU 图模式的情况下运行 DeepSeek,您可能会遇到以下错误。当同时启用 MLA 和图模式时,每个 kv 允许的查询数仅支持 {32, 64, 128},**因此这不支持 DeepSeek-V2-Lite**,因为它只有 16 个注意力头。未来将增加对 DeepSeek-V2-Lite 的 NPU 图模式支持。"
|
||||
#~ "如果在启用 NPU 图模式的情况下运行 DeepSeek,您可能会遇到以下错误。当同时启用 "
|
||||
#~ "MLA 和图模式时,每个 kv 允许的查询数仅支持 {32, 64, "
|
||||
#~ "128},**因此这不支持 DeepSeek-V2-Lite**,因为它只有 16 "
|
||||
#~ "个注意力头。未来将增加对 DeepSeek-V2-Lite 的 NPU 图模式支持。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: 2025-07-18 10:09+0800\n"
|
||||
"Last-Translator: \n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -67,7 +67,9 @@ msgstr "昇腾 HDK"
|
||||
msgid ""
|
||||
"Refer to the documentation [CANN "
|
||||
"8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html)"
|
||||
msgstr "请参考文档 [CANN 8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html)"
|
||||
msgstr ""
|
||||
"请参考文档 [CANN "
|
||||
"8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html)"
|
||||
|
||||
#: ../../source/installation.md
|
||||
msgid "Required for CANN"
|
||||
@@ -139,7 +141,9 @@ msgid ""
|
||||
"installed correctly, refer to [Ascend Environment Setup "
|
||||
"Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) "
|
||||
"for more details."
|
||||
msgstr "安装前,您需要确保固件/驱动和 CANN 已正确安装,更多详情请参考 [昇腾环境搭建指南](https://ascend.github.io/docs/sources/ascend/quick_install.html)。"
|
||||
msgstr ""
|
||||
"安装前,您需要确保固件/驱动和 CANN 已正确安装,更多详情请参考 "
|
||||
"[昇腾环境搭建指南](https://ascend.github.io/docs/sources/ascend/quick_install.html)。"
|
||||
|
||||
#: ../../source/installation.md:29
|
||||
msgid "Configure hardware environment"
|
||||
@@ -156,7 +160,9 @@ msgid ""
|
||||
"Refer to [Ascend Environment Setup "
|
||||
"Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) "
|
||||
"for more details."
|
||||
msgstr "更多详情请参考 [昇腾环境搭建指南](https://ascend.github.io/docs/sources/ascend/quick_install.html)。"
|
||||
msgstr ""
|
||||
"更多详情请参考 "
|
||||
"[昇腾环境搭建指南](https://ascend.github.io/docs/sources/ascend/quick_install.html)。"
|
||||
|
||||
#: ../../source/installation.md:39
|
||||
msgid "Configure software environment"
|
||||
@@ -205,8 +211,7 @@ msgid ""
|
||||
msgstr "如果您使用 `vllm-ascend` 预构建的 Docker 镜像,则无需额外步骤。"
|
||||
|
||||
#: ../../source/installation.md:119
|
||||
msgid ""
|
||||
"Once this is done, you can start to set up `vllm` and `vllm-ascend`."
|
||||
msgid "Once this is done, you can start to set up `vllm` and `vllm-ascend`."
|
||||
msgstr "完成此步骤后,您就可以开始设置 `vllm` 和 `vllm-ascend`。"
|
||||
|
||||
#: ../../source/installation.md:121
|
||||
@@ -240,7 +245,9 @@ msgid ""
|
||||
"If you are building custom operators for Atlas A3, you should run `git "
|
||||
"submodule update --init --recursive` manually, or ensure your environment"
|
||||
" has internet access."
|
||||
msgstr "如果您正在为 Atlas A3 构建自定义算子,您应该手动运行 `git submodule update --init --recursive`,或确保您的环境可以访问互联网。"
|
||||
msgstr ""
|
||||
"如果您正在为 Atlas A3 构建自定义算子,您应该手动运行 `git submodule update --init "
|
||||
"--recursive`,或确保您的环境可以访问互联网。"
|
||||
|
||||
#: ../../source/installation.md:178
|
||||
msgid ""
|
||||
@@ -251,7 +258,11 @@ msgid ""
|
||||
"compiling, it is probably because an unexpected compiler is being used, "
|
||||
"you may export `CXX_COMPILER` and `C_COMPILER` in the environment to "
|
||||
"specify your g++ and gcc locations before compiling."
|
||||
msgstr "构建自定义算子需要 gcc/g++ 版本高于 8 且支持 C++17 或更高标准。如果您使用 `pip install -e .` 并遇到 torch-npu 版本冲突,请使用 `pip install --no-build-isolation -e .` 在系统环境中进行安装。如果在编译过程中遇到其他问题,可能是因为使用了非预期的编译器,您可以在编译前通过环境变量导出 `CXX_COMPILER` 和 `C_COMPILER` 来指定您的 g++ 和 gcc 路径。"
|
||||
msgstr ""
|
||||
"构建自定义算子需要 gcc/g++ 版本高于 8 且支持 C++17 或更高标准。如果您使用 `pip install -e .` 并遇到 "
|
||||
"torch-npu 版本冲突,请使用 `pip install --no-build-isolation "
|
||||
"-e .` 在系统环境中进行安装。如果在编译过程中遇到其他问题,可能是因为使用了非预期的编译器,您可以在编译前通过环境变量导出 `CXX_COMPILER` "
|
||||
"和 `C_COMPILER` 来指定您的 g++ 和 gcc 路径。"
|
||||
|
||||
#: ../../source/installation.md:181
|
||||
msgid ""
|
||||
@@ -259,7 +270,9 @@ msgid ""
|
||||
"unavailable, you need to set `SOC_VERSION` before `pip install -e .` so "
|
||||
"the build can target the correct chip. You can refer to `Dockerfile*` "
|
||||
"defaults, for example:"
|
||||
msgstr "如果您在仅 CPU 的环境中构建,且 `npu-smi` 不可用,则需要在 `pip install -e .` 之前设置 `SOC_VERSION`,以便构建过程能针对正确的芯片。您可以参考 `Dockerfile*` 的默认值,例如:"
|
||||
msgstr ""
|
||||
"如果您在仅 CPU 的环境中构建,且 `npu-smi` 不可用,则需要在 `pip install -e .` 之前设置 "
|
||||
"`SOC_VERSION`,以便构建过程能针对正确的芯片。您可以参考 `Dockerfile*` 的默认值,例如:"
|
||||
|
||||
#: ../../source/installation.md:183
|
||||
msgid "Atlas A2: `export SOC_VERSION=ascend910b1`"
|
||||
@@ -287,7 +300,10 @@ msgid ""
|
||||
"**prebuilt image** from the image repository [ascend/vllm-"
|
||||
"ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and run "
|
||||
"it with bash."
|
||||
msgstr "`vllm-ascend` 提供用于部署的 Docker 镜像。您可以直接从镜像仓库 [ascend/vllm-ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) 拉取 **预构建镜像** 并使用 bash 运行。"
|
||||
msgstr ""
|
||||
"`vllm-ascend` 提供用于部署的 Docker 镜像。您可以直接从镜像仓库 [ascend/vllm-"
|
||||
"ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) 拉取 "
|
||||
"**预构建镜像** 并使用 bash 运行。"
|
||||
|
||||
#: ../../source/installation.md:193
|
||||
msgid "Supported images as following."
|
||||
@@ -362,11 +378,11 @@ msgid ""
|
||||
"The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed"
|
||||
" in `/vllm-workspace` and installed in [development "
|
||||
"mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)"
|
||||
" (`pip install -e`) to help developer immediately take place changes "
|
||||
"without requiring a new installation."
|
||||
" (`pip install -e`) to help developers immediately make changes without "
|
||||
"requiring a new installation."
|
||||
msgstr ""
|
||||
"默认工作目录为 `/workspace`,vLLM 和 vLLM Ascend 代码位于 `/vllm-workspace`"
|
||||
" 目录下,并以[开发模式](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip"
|
||||
"默认工作目录为 `/workspace`,vLLM 和 vLLM Ascend 代码位于 `/vllm-workspace` "
|
||||
"目录下,并以[开发模式](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip"
|
||||
" install -e`)安装,以便开发者能够即时应用更改,而无需重新安装。"
|
||||
|
||||
#: ../../source/installation.md:249
|
||||
@@ -392,7 +408,9 @@ msgid ""
|
||||
" them in the cached files.`), run the following commands to use "
|
||||
"ModelScope as an alternative:"
|
||||
msgstr ""
|
||||
"如果遇到 Hugging Face 连接错误(例如:`We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`),请运行以下命令以使用 ModelScope 作为替代方案:"
|
||||
"如果遇到 Hugging Face 连接错误(例如:`We couldn't connect to "
|
||||
"'https://huggingface.co' to load the files, and couldn't find them in the"
|
||||
" cached files.`),请运行以下命令以使用 ModelScope 作为替代方案:"
|
||||
|
||||
#: ../../source/installation.md:292
|
||||
msgid "The output will be like:"
|
||||
|
||||
@@ -1,14 +1,7 @@
|
||||
# SOME DESCRIPTIVE TITLE.
|
||||
# Copyright (C) 2025, vllm-ascend team
|
||||
# This file is distributed under the same license as the vllm-ascend
|
||||
# package.
|
||||
# FIRST AUTHOR <EMAIL@ADDRESS>, 2026.
|
||||
#
|
||||
msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -187,8 +180,8 @@ msgstr "`--tensor-parallel-size` 16 是张量并行(TP)大小的常见设置
|
||||
|
||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:305
|
||||
msgid ""
|
||||
"`--prefill-context-parallel-size` 2 are common settings for prefill "
|
||||
"context parallelism (PCP) sizes."
|
||||
"`--prefill-context-parallel-size` 2 is common setting for prefill context"
|
||||
" parallelism (PCP) sizes."
|
||||
msgstr "`--prefill-context-parallel-size` 2 是预填充上下文并行(PCP)大小的常见设置。"
|
||||
|
||||
#: ../../source/tutorials/features/long_sequence_context_parallel_multi_node.md:306
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -40,7 +40,9 @@ msgid ""
|
||||
"demonstrates how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two "
|
||||
"Atlas 800T A2 nodes to deploy two vLLM instances. Each instance occupies "
|
||||
"4 NPU cards and uses PD-colocated deployment."
|
||||
msgstr "本指南以 Qwen2.5-72B-Instruct 模型为例,演示如何在两个 Atlas 800T A2 节点上使用 vllm-ascend v0.11.0(包含 vLLM v0.11.0)部署两个 vLLM 实例。每个实例占用 4 个 NPU 卡,并采用 PD 共置部署。"
|
||||
msgstr ""
|
||||
"本指南以 Qwen2.5-72B-Instruct 模型为例,演示如何在两个 Atlas 800T A2 节点上使用 vllm-ascend "
|
||||
"v0.11.0(包含 vLLM v0.11.0)部署两个 vLLM 实例。每个实例占用 4 个 NPU 卡,并采用 PD 共置部署。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:14
|
||||
msgid "Verify Multi-Node Communication Environment"
|
||||
@@ -128,7 +130,10 @@ msgid ""
|
||||
"Mooncake is the serving platform for Kimi, a leading LLM service provided"
|
||||
" by Moonshot AI. Installation and compilation guide: <https://github.com"
|
||||
"/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>."
|
||||
msgstr "Mooncake 是 Kimi 的服务平台,Kimi 是由 Moonshot AI 提供的领先 LLM 服务。安装和编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>。"
|
||||
msgstr ""
|
||||
"Mooncake 是 Kimi 的服务平台,Kimi 是由 Moonshot AI 提供的领先 LLM "
|
||||
"服务。安装和编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file"
|
||||
"#build-and-use-binaries>。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:121
|
||||
msgid "First, obtain the Mooncake project using the following command:"
|
||||
@@ -275,7 +280,10 @@ msgid ""
|
||||
" cross-node, cross-instance KV Cache. Instance 1 utilizes NPU cards [0-3]"
|
||||
" on the first Atlas 800T A2 server, while Instance 2 utilizes cards [0-3]"
|
||||
" on the second server."
|
||||
msgstr "在节点 1 和节点 2 上分别创建容器,并在每个容器中启动 Qwen2.5-72B-Instruct 模型服务,以测试跨节点、跨实例 KV Cache 的可重用性和性能。实例 1 使用第一个 Atlas 800T A2 服务器上的 NPU 卡 [0-3],而实例 2 使用第二个服务器上的卡 [0-3]。"
|
||||
msgstr ""
|
||||
"在节点 1 和节点 2 上分别创建容器,并在每个容器中启动 Qwen2.5-72B-Instruct 模型服务,以测试跨节点、跨实例 KV "
|
||||
"Cache 的可重用性和性能。实例 1 使用第一个 Atlas 800T A2 服务器上的 NPU 卡 [0-3],而实例 2 "
|
||||
"使用第二个服务器上的卡 [0-3]。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:208
|
||||
msgid "Deploy Instance 1"
|
||||
@@ -430,9 +438,9 @@ msgstr "步骤 2 的准备工作"
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:285
|
||||
msgid ""
|
||||
"Before Step 2, send a fully random Dataset B to Instance 1. Due to the "
|
||||
"unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,"
|
||||
" Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's"
|
||||
" cache only in Node 1's DRAM."
|
||||
"unified on-chip memory/DRAM KV Cache with LRU (Least Recently Used) "
|
||||
"eviction policy, Dataset B's cache evicts Dataset A's cache from on-chip "
|
||||
"memory, leaving Dataset A's cache only in Node 1's DRAM."
|
||||
msgstr "在步骤2之前,向实例1发送一个完全随机的数据集B。由于采用了具有LRU(最近最少使用)淘汰策略的统一HBM/DRAM KV缓存,数据集B的缓存会将数据集A的缓存从HBM中淘汰,使得数据集A的缓存仅保留在节点1的DRAM中。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_colocated_mooncake_multi_instance.md:290
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -40,7 +40,7 @@ msgid ""
|
||||
"servers to deploy the \"2P1D\" architecture. Assume the IP of the "
|
||||
"prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and "
|
||||
"the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). "
|
||||
"On each server, use 8 NPUs 16 chips to deploy one service instance."
|
||||
"On each server, use 8 NPUs and 16 chips to deploy one service instance."
|
||||
msgstr ""
|
||||
"以 Deepseek-r1-w8a8 模型为例,使用 4 台 Atlas 800T A3 服务器部署 \"2P1D\" 架构。假设预填充服务器 "
|
||||
"IP 为 192.0.0.1(预填充节点 1)和 192.0.0.2(预填充节点 2),解码服务器 IP 为 192.0.0.3(解码节点 1)和"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -30,16 +30,17 @@ msgstr "开始使用"
|
||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:5
|
||||
msgid ""
|
||||
"vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide "
|
||||
"takes one-by-one steps to verify these features with constrained "
|
||||
"resources."
|
||||
msgstr "vLLM-Ascend 现已支持预填充-解码 (PD) 解耦架构。本指南将逐步引导您在有限资源下验证这些功能。"
|
||||
"provides step-by-step instructions to verify this features in resource-"
|
||||
"constrained environments."
|
||||
msgstr "vLLM-Ascend 现已支持预填充-解码 (PD) 解耦架构。本指南提供逐步说明,帮助您在资源受限的环境中验证这些功能。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:7
|
||||
msgid ""
|
||||
"Using the Qwen2.5-VL-7B-Instruct model as an example, use vLLM-Ascend "
|
||||
"Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend "
|
||||
"v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "
|
||||
"\"1P1D\" architecture. Assume the IP address is 192.0.0.1."
|
||||
msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例,在 1 台 Atlas 800T A2 服务器上使用 vLLM-Ascend v0.11.0rc1 (包含 vLLM v0.11.0) 部署 \"1P1D\" 架构。假设 IP 地址为 192.0.0.1。"
|
||||
"\"1P1D\" architecture (one Prefiller and one Decoder on the same node). "
|
||||
"Assume the IP address is 192.0.0.1."
|
||||
msgstr "以 Qwen2.5-VL-7B-Instruct 模型为例,在 1 台 Atlas 800T A2 服务器上使用 vllm-ascend v0.11.0rc1(包含 vLLM v0.11.0)部署 \"1P1D\" 架构(同一节点上一个预填充器和一个解码器)。假设 IP 地址为 192.0.0.1。"
|
||||
|
||||
#: ../../source/tutorials/features/pd_disaggregation_mooncake_single_node.md:9
|
||||
msgid "Verify Communication Environment"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -32,32 +32,25 @@ msgid ""
|
||||
"DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-"
|
||||
"thinking mode. Compared to the previous version, this upgrade brings "
|
||||
"improvements in multiple aspects:"
|
||||
msgstr ""
|
||||
"DeepSeek-V3.1 是一个支持思考模式和非思考模式的混合模型。与前一版本相比,此"
|
||||
"次升级在多个方面带来了改进:"
|
||||
msgstr "DeepSeek-V3.1 是一个支持思考模式和非思考模式的混合模型。与前一版本相比,此次升级在多个方面带来了改进:"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:7
|
||||
msgid ""
|
||||
"Hybrid thinking mode: One model supports both thinking mode and non-"
|
||||
"thinking mode by changing the chat template."
|
||||
msgstr ""
|
||||
"混合思考模式:一个模型通过更改聊天模板,同时支持思考模式和非思考模式。"
|
||||
msgstr "混合思考模式:一个模型通过更改聊天模板,同时支持思考模式和非思考模式。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:9
|
||||
msgid ""
|
||||
"Smarter tool calling: Through post-training optimization, the model's "
|
||||
"performance in tool usage and agent tasks has significantly improved."
|
||||
msgstr ""
|
||||
"更智能的工具调用:通过后训练优化,模型在工具使用和智能体任务方面的性能显著提"
|
||||
"升。"
|
||||
msgstr "更智能的工具调用:通过后训练优化,模型在工具使用和智能体任务方面的性能显著提升。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:11
|
||||
msgid ""
|
||||
"Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable "
|
||||
"answer quality to DeepSeek-R1-0528, while responding more quickly."
|
||||
msgstr ""
|
||||
"更高的思考效率:DeepSeek-V3.1-Think 实现了与 DeepSeek-R1-0528 相当的答案质"
|
||||
"量,同时响应速度更快。"
|
||||
msgstr "更高的思考效率:DeepSeek-V3.1-Think 实现了与 DeepSeek-R1-0528 相当的答案质量,同时响应速度更快。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:13
|
||||
msgid "The `DeepSeek-V3.1` model is first supported in `vllm-ascend:v0.9.1rc3`."
|
||||
@@ -69,9 +62,7 @@ msgid ""
|
||||
"including supported features, feature configuration, environment "
|
||||
"preparation, single-node and multi-node deployment, accuracy and "
|
||||
"performance evaluation."
|
||||
msgstr ""
|
||||
"本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点"
|
||||
"和多节点部署、精度和性能评估。"
|
||||
msgstr "本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:17
|
||||
msgid "Supported Features"
|
||||
@@ -90,9 +81,7 @@ msgstr ""
|
||||
msgid ""
|
||||
"Refer to [feature guide](../../user_guide/feature_guide/index.md) to get "
|
||||
"the feature's configuration."
|
||||
msgstr ""
|
||||
"请参考 [特性指南](../../user_guide/feature_guide/index.md) 以获取特性的配"
|
||||
"置。"
|
||||
msgstr "请参考 [特性指南](../../user_guide/feature_guide/index.md) 以获取特性的配置。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:23
|
||||
msgid "Environment Preparation"
|
||||
@@ -107,8 +96,8 @@ msgid ""
|
||||
"`DeepSeek-V3.1`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1`(BF16 版本):[下载模型权重](https://www.modelscope.cn/"
|
||||
"models/deepseek-ai/DeepSeek-V3.1)。"
|
||||
"`DeepSeek-V3.1`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models"
|
||||
"/deepseek-ai/DeepSeek-V3.1)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:28
|
||||
msgid ""
|
||||
@@ -116,9 +105,9 @@ msgid ""
|
||||
"[Download model weight](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`(混合 MTP 量化版本):[下载模型权重]"
|
||||
"(https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-"
|
||||
"QuaRot)。"
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`(混合 MTP "
|
||||
"量化版本):[下载模型权重](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:29
|
||||
msgid ""
|
||||
@@ -126,9 +115,9 @@ msgid ""
|
||||
" [Download model weight](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(混合 MTP 量化版本):[下载模型权"
|
||||
"重](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-"
|
||||
"mtp-QuaRot)。"
|
||||
"`DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(混合 MTP "
|
||||
"量化版本):[下载模型权重](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1"
|
||||
"-Terminus-w4a8-mtp-QuaRot)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:30
|
||||
#, python-format
|
||||
@@ -137,8 +126,7 @@ msgid ""
|
||||
"[msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96)."
|
||||
" You can use this method to quantize the model."
|
||||
msgstr ""
|
||||
"`量化方法`:"
|
||||
"[msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96)。"
|
||||
"`量化方法`:[msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96)。"
|
||||
" 您可以使用此方法对模型进行量化。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:32
|
||||
@@ -157,8 +145,8 @@ msgid ""
|
||||
"node communication according to [verify multi-node communication "
|
||||
"environment](../../installation.md#verify-multi-node-communication)."
|
||||
msgstr ""
|
||||
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation."
|
||||
"md#verify-multi-node-communication) 验证多节点通信。"
|
||||
"如果您想部署多节点环境,需要根据 [验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||
"communication) 验证多节点通信。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:38
|
||||
msgid "Installation"
|
||||
@@ -174,8 +162,8 @@ msgid ""
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考 [使用 docker]"
|
||||
"(../../installation.md#set-up-using-docker)。"
|
||||
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考 [使用 docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:80
|
||||
msgid ""
|
||||
@@ -195,9 +183,7 @@ msgstr "单节点部署"
|
||||
msgid ""
|
||||
"Quantized model `DeepSeek-V3.1-w8a8-mtp-QuaRot` can be deployed on 1 "
|
||||
"Atlas 800 A3 (64G × 16)."
|
||||
msgstr ""
|
||||
"量化模型 `DeepSeek-V3.1-w8a8-mtp-QuaRot` 可以部署在 1 台 Atlas 800 A3 "
|
||||
"(64G × 16)上。"
|
||||
msgstr "量化模型 `DeepSeek-V3.1-w8a8-mtp-QuaRot` 可以部署在 1 台 Atlas 800 A3 (64G × 16)上。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:88
|
||||
msgid "Run the following script to execute online inference."
|
||||
@@ -215,9 +201,8 @@ msgid ""
|
||||
" Furthermore, enabling this feature is not recommended in scenarios where"
|
||||
" PD is separated."
|
||||
msgstr ""
|
||||
"设置环境变量 `VLLM_ASCEND_BALANCE_SCHEDULING=1` 启用均衡调度。这可能有助于"
|
||||
"在 v1 调度器中提高输出吞吐量并降低 TPOT。然而,在某些场景下 TTFT 可能会下"
|
||||
"降。此外,在 PD 分离的场景中不建议启用此功能。"
|
||||
"设置环境变量 `VLLM_ASCEND_BALANCE_SCHEDULING=1` 启用均衡调度。这可能有助于在 v1 "
|
||||
"调度器中提高输出吞吐量并降低 TPOT。然而,在某些场景下 TTFT 可能会下降。此外,在 PD 分离的场景中不建议启用此功能。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:135
|
||||
msgid ""
|
||||
@@ -233,24 +218,20 @@ msgid ""
|
||||
"`16384` is sufficient, however, for precision testing, please set it at "
|
||||
"least `35000`."
|
||||
msgstr ""
|
||||
"`--max-model-len` 指定最大上下文长度——即单个请求的输入和输出令牌之和。对于输"
|
||||
"入长度为 3.5K 和输出长度为 1.5K 的性能测试,`16384` 的值就足够了,但是,对于"
|
||||
"精度测试,请至少将其设置为 `35000`。"
|
||||
"`--max-model-len` 指定最大上下文长度——即单个请求的输入和输出令牌之和。对于输入长度为 3.5K 和输出长度为 1.5K "
|
||||
"的性能测试,`16384` 的值就足够了,但是,对于精度测试,请至少将其设置为 `35000`。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:137
|
||||
msgid ""
|
||||
"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
|
||||
"To enable it, remove this option."
|
||||
msgstr ""
|
||||
"`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它,请移除此选项。"
|
||||
msgstr "`--no-enable-prefix-caching` 表示前缀缓存被禁用。要启用它,请移除此选项。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:138
|
||||
msgid ""
|
||||
"If you use the w4a8 weight, more memory will be allocated to kvcache, and"
|
||||
" you can try to increase system throughput to achieve greater throughput."
|
||||
msgstr ""
|
||||
"如果使用 w4a8 权重,将分配更多内存给 kvcache,您可以尝试增加系统吞吐量以实现"
|
||||
"更大的吞吐量。"
|
||||
msgstr "如果使用 w4a8 权重,将分配更多内存给 kvcache,您可以尝试增加系统吞吐量以实现更大的吞吐量。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:140
|
||||
msgid "Multi-node Deployment"
|
||||
@@ -260,8 +241,7 @@ msgstr "多节点部署"
|
||||
msgid ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`: require at least 2 Atlas 800 A2 (64G × "
|
||||
"8)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot`:需要至少 2 台 Atlas 800 A2(64G × 8)。"
|
||||
msgstr "`DeepSeek-V3.1-w8a8-mtp-QuaRot`:需要至少 2 台 Atlas 800 A2(64G × 8)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:144
|
||||
msgid "Run the following scripts on two nodes respectively."
|
||||
@@ -284,8 +264,8 @@ msgid ""
|
||||
"We recommend using Mooncake for deployment: "
|
||||
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
||||
msgstr ""
|
||||
"我们建议使用 Mooncake 进行部署:[Mooncake](../features/"
|
||||
"pd_disaggregation_mooncake_multi_node.md)。"
|
||||
"我们建议使用 Mooncake "
|
||||
"进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:256
|
||||
msgid ""
|
||||
@@ -293,27 +273,27 @@ msgid ""
|
||||
"nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory "
|
||||
"to serve high concurrency in 1P1D case."
|
||||
msgstr ""
|
||||
"以 Atlas 800 A3(64G × 16)为例,我们建议部署 2P1D(4 个节点)而不是 1P1D"
|
||||
"(2 个节点),因为在 1P1D 情况下没有足够的 NPU 内存来服务高并发。"
|
||||
"以 Atlas 800 A3(64G × 16)为例,我们建议部署 2P1D(4 个节点)而不是 1P1D(2 个节点),因为在 1P1D "
|
||||
"情况下没有足够的 NPU 内存来服务高并发。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:258
|
||||
msgid ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 "
|
||||
"(64G × 16)."
|
||||
msgstr ""
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` 需要 4 台 Atlas 800 A3 "
|
||||
"(64G × 16)。"
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` 需要 4 台 Atlas 800 A3 (64G ×"
|
||||
" 16)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:260
|
||||
msgid ""
|
||||
"To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need "
|
||||
"to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` "
|
||||
"to deploy a `launch_online_dp.py` script and a `run_dp_template.sh` "
|
||||
"script on each node and deploy a `proxy.sh` script on prefill master node"
|
||||
" to forward requests."
|
||||
msgstr ""
|
||||
"要运行 vllm-ascend `Prefill-Decode 解耦`服务,您需要在每个节点上部署一个 "
|
||||
"`launch_dp_program.py` 脚本和一个 `run_dp_template.sh` 脚本,并在 prefill "
|
||||
"主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||
"`launch_online_dp.py` 脚本和一个 `run_dp_template.sh` 脚本,并在 prefill 主节点上部署一个 "
|
||||
"`proxy.sh` 脚本来转发请求。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:262
|
||||
msgid ""
|
||||
@@ -321,9 +301,9 @@ msgid ""
|
||||
"[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||
msgstr ""
|
||||
"`launch_online_dp.py` 用于启动外部 dp vllm 服务器。[launch\\_online\\_dp."
|
||||
"py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/"
|
||||
"external_online_dp/launch_online_dp.py)"
|
||||
"`launch_online_dp.py` 用于启动外部 dp vllm "
|
||||
"服务器。[launch\\_online\\_dp.py](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/external_online_dp/launch_online_dp.py)"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:265
|
||||
msgid "Prefill Node 0 `run_dp_template.sh` script"
|
||||
@@ -358,8 +338,8 @@ msgid ""
|
||||
"Prefill-Decode (PD) separation scenario, enable MLAPO only on decode "
|
||||
"nodes."
|
||||
msgstr ""
|
||||
"`VLLM_ASCEND_ENABLE_MLAPO=1`:启用融合算子,这可以显著提高性能但会消耗更多 "
|
||||
"NPU 内存。在 Prefill-Decode (PD) 分离场景中,仅在 decode 节点上启用 MLAPO。"
|
||||
"`VLLM_ASCEND_ENABLE_MLAPO=1`:启用融合算子,这可以显著提高性能但会消耗更多 NPU 内存。在 Prefill-"
|
||||
"Decode (PD) 分离场景中,仅在 decode 节点上启用 MLAPO。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:576
|
||||
msgid ""
|
||||
@@ -367,9 +347,7 @@ msgid ""
|
||||
"Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of "
|
||||
"operator delivery can be implemented to overlap the operator delivery "
|
||||
"latency."
|
||||
msgstr ""
|
||||
"`--async-scheduling`:启用异步调度功能。当启用多令牌预测 (MTP) 时,可以实现算"
|
||||
"子交付的异步调度,以重叠算子交付延迟。"
|
||||
msgstr "`--async-scheduling`:启用异步调度功能。当启用多令牌预测 (MTP) 时,可以实现算子交付的异步调度,以重叠算子交付延迟。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:577
|
||||
msgid ""
|
||||
@@ -378,9 +356,8 @@ msgid ""
|
||||
"it is recommended to set them to the number of frequently occurring "
|
||||
"requests on the Decode (D) node."
|
||||
msgstr ""
|
||||
"`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大"
|
||||
"值为 `n = max-num-seqs`。对于其他值,建议将其设置为 Decode (D) 节点上频繁出"
|
||||
"现的请求数量。"
|
||||
"`cudagraph_capture_sizes`:推荐值为 `n x (mtp + 1)`。最小值为 `n = 1`,最大值为 `n = "
|
||||
"max-num-seqs`。对于其他值,建议将其设置为 Decode (D) 节点上频繁出现的请求数量。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:578
|
||||
msgid ""
|
||||
@@ -390,9 +367,9 @@ msgid ""
|
||||
"the PD separation scenario, it is recommended to enable this "
|
||||
"configuration on both prefill and decode nodes simultaneously."
|
||||
msgstr ""
|
||||
"`recompute_scheduler_enable: true`:启用重计算调度器。当 decode 节点的键值缓"
|
||||
"存 (KV Cache) 不足时,请求将被发送到 prefill 节点以重新计算 KV Cache。在 PD "
|
||||
"分离场景中,建议同时在 prefill 和 decode 节点上启用此配置。"
|
||||
"`recompute_scheduler_enable: true`:启用重计算调度器。当 decode 节点的键值缓存 (KV Cache) "
|
||||
"不足时,请求将被发送到 prefill 节点以重新计算 KV Cache。在 PD 分离场景中,建议同时在 prefill 和 decode "
|
||||
"节点上启用此配置。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:579
|
||||
msgid ""
|
||||
@@ -402,8 +379,7 @@ msgid ""
|
||||
"improved efficiency."
|
||||
msgstr ""
|
||||
"`multistream_overlap_shared_expert: true`:当张量并行 (TP) 大小为 1 或 "
|
||||
"`enable_shared_expert_dp: true` 时,启用额外的流来重叠共享专家的计算过程,以"
|
||||
"提高效率。"
|
||||
"`enable_shared_expert_dp: true` 时,启用额外的流来重叠共享专家的计算过程,以提高效率。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:580
|
||||
msgid ""
|
||||
@@ -412,9 +388,8 @@ msgid ""
|
||||
"embedding layer to be greater than 1, which is used to reduce the "
|
||||
"computational load of each card on the LMHead embedding layer."
|
||||
msgstr ""
|
||||
"`lmhead_tensor_parallel_size: 16`:当 decode 节点的张量并行 (TP) 大小为 1 "
|
||||
"时,此参数允许 LMHead 嵌入层的 TP 大小大于 1,用于减少每张卡在 LMHead 嵌入层"
|
||||
"上的计算负载。"
|
||||
"`lmhead_tensor_parallel_size: 16`:当 decode 节点的张量并行 (TP) 大小为 1 时,此参数允许 "
|
||||
"LMHead 嵌入层的 TP 大小大于 1,用于减少每张卡在 LMHead 嵌入层上的计算负载。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:582
|
||||
msgid "run server for each node:"
|
||||
@@ -431,7 +406,10 @@ msgid ""
|
||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||
"project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:653
|
||||
msgid "Functional Verification"
|
||||
@@ -466,7 +444,9 @@ msgid ""
|
||||
"After execution, you can get the result, here is the result of "
|
||||
"`DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference "
|
||||
"only."
|
||||
msgstr "执行后,您可以获得结果。以下是 `vllm-ascend:0.11.0rc1` 中 `DeepSeek-V3.1-w8a8-mtp-QuaRot` 的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果。以下是 `vllm-ascend:0.11.0rc1` 中 `DeepSeek-V3.1-w8a8-mtp-QuaRot`"
|
||||
" 的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:44
|
||||
msgid "dataset"
|
||||
@@ -541,7 +521,10 @@ msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参考[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参考[使用 AISBench "
|
||||
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/DeepSeek-V3.1.md:693
|
||||
msgid "The performance result is:"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -74,41 +74,56 @@ msgstr "模型权重"
|
||||
msgid ""
|
||||
"`GLM-4.5`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5)."
|
||||
msgstr "`GLM-4.5`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5)。"
|
||||
msgstr ""
|
||||
"`GLM-4.5`(BF16 "
|
||||
"版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.5)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:22
|
||||
msgid ""
|
||||
"`GLM-4.6`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6)."
|
||||
msgstr "`GLM-4.6`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6)。"
|
||||
msgstr ""
|
||||
"`GLM-4.6`(BF16 "
|
||||
"版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.6)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:23
|
||||
msgid ""
|
||||
"`GLM-4.7`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7)."
|
||||
msgstr "`GLM-4.7`(BF16 版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7)。"
|
||||
msgstr ""
|
||||
"`GLM-4.7`(BF16 "
|
||||
"版本):[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-4.7)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:24
|
||||
msgid ""
|
||||
"`GLM-4.5-w8a8-with-float-mtp`(Quantized version with mtp): [Download "
|
||||
"model weight](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8)."
|
||||
msgstr "`GLM-4.5-w8a8-with-float-mtp`(带 mtp 的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8)。"
|
||||
msgstr ""
|
||||
"`GLM-4.5-w8a8-with-float-mtp`(带 mtp "
|
||||
"的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.5-w8a8)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:25
|
||||
msgid ""
|
||||
"`GLM-4.6-w8a8`(Quantized version without mtp): [Download model "
|
||||
"weight](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8). Because "
|
||||
"vllm do not support GLM4.6 mtp in October, so we do not provide mtp "
|
||||
"version. And last month, it supported, you can use the following "
|
||||
"quantization scheme to add mtp weights to Quantized weights."
|
||||
msgstr "`GLM-4.6-w8a8`(不带 mtp 的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8)。由于 vllm 在十月份不支持 GLM4.6 的 mtp,因此我们不提供 mtp 版本。上个月已支持,您可以使用以下量化方案将 mtp 权重添加到量化权重中。"
|
||||
"vllm does not support GLM4.6 mtp in October, we do not provide an mtp "
|
||||
"version. Last month, it was supported; you can use the following "
|
||||
"quantization scheme to add mtp weights to the quantized weights."
|
||||
msgstr ""
|
||||
"`GLM-4.6-w8a8`(不带 mtp "
|
||||
"的量化版本):[下载模型权重](https://modelers.cn/models/Modelers_Park/GLM-4.6-w8a8)。由于"
|
||||
" vllm 在十月份不支持 GLM4.6 的 mtp,因此我们不提供 mtp 版本。上个月已支持,您可以使用以下量化方案将 mtp "
|
||||
"权重添加到量化权重中。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:26
|
||||
msgid ""
|
||||
"`GLM-4.7-w8a8-with-float-mtp`(Quantized version without mtp): [Download "
|
||||
"model weight](https://modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-4.7-W8A8-floatmtp)."
|
||||
msgstr "`GLM-4.7-w8a8-with-float-mtp`(不带 mtp 的量化版本):[下载模型权重](https://modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8-floatmtp)。"
|
||||
msgstr ""
|
||||
"`GLM-4.7-w8a8-with-float-mtp`(不带 mtp "
|
||||
"的量化版本):[下载模型权重](https://modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-4.7-W8A8-floatmtp)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:27
|
||||
msgid ""
|
||||
@@ -136,14 +151,17 @@ msgid "A3 series"
|
||||
msgstr "A3 系列"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:42
|
||||
#: ../../source/tutorials/models/GLM4.x.md:85
|
||||
msgid "Start the docker image on your each node."
|
||||
msgstr "在您的每个节点上启动 docker 镜像。"
|
||||
msgid "Start the docker image on each node."
|
||||
msgstr "在每个节点上启动 docker 镜像。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md
|
||||
msgid "A2 series"
|
||||
msgstr "A2 系列"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:85
|
||||
msgid "Start the docker image on your each node."
|
||||
msgstr "在每个节点上启动 docker 镜像。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:118
|
||||
msgid ""
|
||||
"In addition, if you don't want to use the docker image as above, you can "
|
||||
@@ -180,7 +198,12 @@ msgid ""
|
||||
"The optimization of the FIA operator will be enabled by default in CANN "
|
||||
"9.x releases, and manual replacement will no longer be required. Please "
|
||||
"stay tuned for updates to this document."
|
||||
msgstr "我们已在 CANN 8.5.1 中优化了 FIA 算子。需要手动替换与 FIA 算子相关的文件。请执行 FIA 算子替换脚本:[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh) 和 [A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)。FIA 算子的优化将在 CANN 9.x 版本中默认启用,届时将不再需要手动替换。请关注本文档的更新。"
|
||||
msgstr ""
|
||||
"我们已在 CANN 8.5.1 中优化了 FIA 算子。需要手动替换与 FIA 算子相关的文件。请执行 FIA "
|
||||
"算子替换脚本:[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh)"
|
||||
" 和 "
|
||||
"[A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)。FIA"
|
||||
" 算子的优化将在 CANN 9.x 版本中默认启用,届时将不再需要手动替换。请关注本文档的更新。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:132
|
||||
msgid "Single-node Deployment"
|
||||
@@ -194,144 +217,155 @@ msgstr "在低延迟场景下,我们推荐单机部署。"
|
||||
msgid ""
|
||||
"Quantized model `glm4.7_w8a8_with_float_mtp` can be deployed on 1 Atlas "
|
||||
"800 A3 (64G × 16) or 1 Atlas 800 A2 (64G × 8)."
|
||||
msgstr "量化模型 `glm4.7_w8a8_with_float_mtp` 可以部署在 1 台 Atlas 800 A3(64G × 16)或 1 台 Atlas 800 A2(64G × 8)上。"
|
||||
msgstr ""
|
||||
"量化模型 `glm4.7_w8a8_with_float_mtp` 可以部署在 1 台 Atlas 800 A3(64G × 16)或 1 台 "
|
||||
"Atlas 800 A2(64G × 8)上。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:137
|
||||
msgid "Run the following script to execute online inference."
|
||||
msgstr "运行以下脚本以执行在线推理。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:169
|
||||
#: ../../source/tutorials/models/GLM4.x.md:168
|
||||
msgid "**Notice:** The parameters are explained as follows:"
|
||||
msgstr "**注意:** 参数解释如下:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:172
|
||||
#: ../../source/tutorials/models/GLM4.x.md:171
|
||||
msgid ""
|
||||
"`--async-scheduling` Asynchronous scheduling is a technique used to "
|
||||
"optimize inference efficiency. It allows non-blocking task scheduling to "
|
||||
"improve concurrency and throughput, especially when processing large-"
|
||||
"scale models."
|
||||
msgstr "`--async-scheduling` 异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,特别是在处理大规模模型时。"
|
||||
msgstr ""
|
||||
"`--async-scheduling` "
|
||||
"异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,特别是在处理大规模模型时。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:173
|
||||
#: ../../source/tutorials/models/GLM4.x.md:172
|
||||
msgid ""
|
||||
"`fusion_ops_gmmswigluquant` The performance of the GmmSwigluQuant fusion "
|
||||
"operator tends to degrade when the total number of NPUs is ≤ 16."
|
||||
msgstr "`fusion_ops_gmmswigluquant` 当 NPU 总数 ≤ 16 时,GmmSwigluQuant 融合算子的性能往往会下降。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:175
|
||||
#: ../../source/tutorials/models/GLM4.x.md:174
|
||||
msgid "Multi-node Deployment"
|
||||
msgstr "多节点部署"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:177
|
||||
#: ../../source/tutorials/models/GLM4.x.md:176
|
||||
msgid ""
|
||||
"Although the former tutorial said \"Not recommended to deploy multi-node "
|
||||
"on Atlas 800 A2 (64G × 8)\", but if you insist to deploy GLM-4.x model on"
|
||||
" multi-node like 2 × Atlas 800 A2 (64G × 8), run the following scripts on"
|
||||
" two nodes respectively."
|
||||
msgstr "尽管之前的教程提到“不建议在 Atlas 800 A2(64G × 8)上部署多节点”,但如果您坚持要在类似 2 × Atlas 800 A2(64G × 8)的多节点上部署 GLM-4.x 模型,请分别在两个节点上运行以下脚本。"
|
||||
msgstr ""
|
||||
"尽管之前的教程提到“不建议在 Atlas 800 A2(64G × 8)上部署多节点”,但如果您坚持要在类似 2 × Atlas 800 "
|
||||
"A2(64G × 8)的多节点上部署 GLM-4.x 模型,请分别在两个节点上运行以下脚本。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:179
|
||||
#: ../../source/tutorials/models/GLM4.x.md:178
|
||||
msgid "**Node 0**"
|
||||
msgstr "**节点 0**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:230
|
||||
#: ../../source/tutorials/models/GLM4.x.md:228
|
||||
msgid "**Node 1**"
|
||||
msgstr "**节点 1**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:283
|
||||
#: ../../source/tutorials/models/GLM4.x.md:280
|
||||
msgid "Prefill-Decode Disaggregation"
|
||||
msgstr "Prefill-Decode 解耦部署"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:285
|
||||
#: ../../source/tutorials/models/GLM4.x.md:282
|
||||
msgid ""
|
||||
"We'd like to show the deployment guide of `GLM4.7` on multi-node "
|
||||
"environment with 2P1D for better performance."
|
||||
msgstr "我们将展示 `GLM4.7` 在多节点环境(2P1D)下的部署指南,以获得更好的性能。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:287
|
||||
#: ../../source/tutorials/models/GLM4.x.md:284
|
||||
msgid "Before you start, please"
|
||||
msgstr "在开始之前,请"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:289
|
||||
#: ../../source/tutorials/models/GLM4.x.md:286
|
||||
msgid "prepare the script `launch_online_dp.py` on each node:"
|
||||
msgstr "在每个节点上准备脚本 `launch_online_dp.py`:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:392
|
||||
#: ../../source/tutorials/models/GLM4.x.md:389
|
||||
msgid "prepare the script `run_dp_template.sh` on each node."
|
||||
msgstr "在每个节点上准备脚本 `run_dp_template.sh`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:394
|
||||
#: ../../source/tutorials/models/GLM4.x.md:669
|
||||
#: ../../source/tutorials/models/GLM4.x.md:391
|
||||
#: ../../source/tutorials/models/GLM4.x.md:664
|
||||
msgid "Prefill node 0"
|
||||
msgstr "Prefill 节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:460
|
||||
#: ../../source/tutorials/models/GLM4.x.md:676
|
||||
#: ../../source/tutorials/models/GLM4.x.md:456
|
||||
#: ../../source/tutorials/models/GLM4.x.md:671
|
||||
msgid "Prefill node 1"
|
||||
msgstr "Prefill 节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:525
|
||||
#: ../../source/tutorials/models/GLM4.x.md:683
|
||||
#: ../../source/tutorials/models/GLM4.x.md:520
|
||||
#: ../../source/tutorials/models/GLM4.x.md:678
|
||||
msgid "Decode node 0"
|
||||
msgstr "Decode 节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:596
|
||||
#: ../../source/tutorials/models/GLM4.x.md:690
|
||||
#: ../../source/tutorials/models/GLM4.x.md:591
|
||||
#: ../../source/tutorials/models/GLM4.x.md:685
|
||||
msgid "Decode node 1"
|
||||
msgstr "Decode 节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:667
|
||||
#: ../../source/tutorials/models/GLM4.x.md:662
|
||||
msgid ""
|
||||
"Once the preparation is done, you can start the server with the following"
|
||||
" command on each node:"
|
||||
msgstr "准备工作完成后,您可以在每个节点上使用以下命令启动服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:697
|
||||
#: ../../source/tutorials/models/GLM4.x.md:692
|
||||
msgid "Request Forwarding"
|
||||
msgstr "请求转发"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:699
|
||||
#: ../../source/tutorials/models/GLM4.x.md:694
|
||||
msgid ""
|
||||
"To set up request forwarding, run the following script on any machine. "
|
||||
"You can get the proxy program in the repository's examples: "
|
||||
"[load_balance_proxy_server_example.py](https://github.com/vllm-project"
|
||||
"/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:728
|
||||
#: ../../source/tutorials/models/GLM4.x.md:723
|
||||
msgid "Functional Verification"
|
||||
msgstr "功能验证"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:730
|
||||
#: ../../source/tutorials/models/GLM4.x.md:725
|
||||
msgid "Once your server is started, you can query the model with input prompts:"
|
||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:749
|
||||
#: ../../source/tutorials/models/GLM4.x.md:744
|
||||
msgid "Accuracy Evaluation"
|
||||
msgstr "精度评估"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:751
|
||||
#: ../../source/tutorials/models/GLM4.x.md:746
|
||||
msgid "Here are two accuracy evaluation methods."
|
||||
msgstr "这里有两种精度评估方法。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:753
|
||||
#: ../../source/tutorials/models/GLM4.x.md:770
|
||||
#: ../../source/tutorials/models/GLM4.x.md:748
|
||||
#: ../../source/tutorials/models/GLM4.x.md:765
|
||||
msgid "Using AISBench"
|
||||
msgstr "使用 AISBench"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:755
|
||||
#: ../../source/tutorials/models/GLM4.x.md:750
|
||||
msgid ""
|
||||
"Refer to [Using "
|
||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||
"details."
|
||||
msgstr "详情请参考[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:757
|
||||
#: ../../source/tutorials/models/GLM4.x.md:752
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `GLM4.7` "
|
||||
"in `vllm-ascend:main` (after `vllm-ascend:0.14.0rc1`) for reference only."
|
||||
msgstr "执行后,您可以获得结果,以下是 `GLM4.7` 在 `vllm-ascend:main`(`vllm-ascend:0.14.0rc1` 之后)中的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果,以下是 `GLM4.7` 在 `vllm-ascend:main`(`vllm-ascend:0.14.0rc1` "
|
||||
"之后)中的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:87
|
||||
msgid "dataset"
|
||||
@@ -389,111 +423,111 @@ msgstr "MATH500"
|
||||
msgid "98.8"
|
||||
msgstr "98.8"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:764
|
||||
#: ../../source/tutorials/models/GLM4.x.md:759
|
||||
msgid "Using Language Model Evaluation Harness"
|
||||
msgstr "使用语言模型评估工具"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:766
|
||||
#: ../../source/tutorials/models/GLM4.x.md:761
|
||||
msgid "Not tested yet."
|
||||
msgstr "尚未测试。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:768
|
||||
#: ../../source/tutorials/models/GLM4.x.md:763
|
||||
msgid "Performance"
|
||||
msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:772
|
||||
#: ../../source/tutorials/models/GLM4.x.md:767
|
||||
msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr ""
|
||||
"详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
"详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md"
|
||||
"#execute-performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:774
|
||||
#: ../../source/tutorials/models/GLM4.x.md:769
|
||||
msgid "Using vLLM Benchmark"
|
||||
msgstr "使用vLLM基准测试"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:776
|
||||
#: ../../source/tutorials/models/GLM4.x.md:771
|
||||
msgid "Run performance evaluation of `GLM-4.x` as an example."
|
||||
msgstr "以运行 `GLM-4.x` 的性能评估为例。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:778
|
||||
#: ../../source/tutorials/models/GLM4.x.md:773
|
||||
msgid ""
|
||||
"Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) "
|
||||
"for more details."
|
||||
msgstr ""
|
||||
"更多详情请参考 [vllm基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
msgstr "更多详情请参考 [vllm基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:780
|
||||
#: ../../source/tutorials/models/GLM4.x.md:775
|
||||
msgid "There are three `vllm bench` subcommands:"
|
||||
msgstr "`vllm bench` 包含三个子命令:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:782
|
||||
#: ../../source/tutorials/models/GLM4.x.md:777
|
||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||
msgstr "`latency`:基准测试单批次请求的延迟。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:783
|
||||
#: ../../source/tutorials/models/GLM4.x.md:778
|
||||
msgid "`serve`: Benchmark the online serving throughput."
|
||||
msgstr "`serve`:基准测试在线服务吞吐量。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:784
|
||||
#: ../../source/tutorials/models/GLM4.x.md:779
|
||||
msgid "`throughput`: Benchmark offline inference throughput."
|
||||
msgstr "`throughput`:基准测试离线推理吞吐量。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:786
|
||||
#: ../../source/tutorials/models/GLM4.x.md:781
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
msgstr "以 `serve` 为例,运行以下代码。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:808
|
||||
#: ../../source/tutorials/models/GLM4.x.md:803
|
||||
msgid ""
|
||||
"After about several minutes, you can get the performance evaluation "
|
||||
"result."
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:810
|
||||
#: ../../source/tutorials/models/GLM4.x.md:805
|
||||
msgid "Best Practices"
|
||||
msgstr "最佳实践"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:812
|
||||
#: ../../source/tutorials/models/GLM4.x.md:807
|
||||
msgid "In this chapter, we recommend best practices for three scenarios:"
|
||||
msgstr "本章节,我们针对三种场景推荐最佳实践:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:814
|
||||
#: ../../source/tutorials/models/GLM4.x.md:809
|
||||
msgid ""
|
||||
"Long-context: For long sequences with low concurrency (≤ 4): set `dp1 "
|
||||
"tp16`; For long sequences with high concurrency (> 4): set `dp2 tp8`"
|
||||
msgstr ""
|
||||
"长上下文:对于低并发(≤ 4)的长序列,设置 `dp1 tp16`;对于高并发(> 4)的长序列,设置 `dp2 tp8`"
|
||||
msgstr "长上下文:对于低并发(≤ 4)的长序列,设置 `dp1 tp16`;对于高并发(> 4)的长序列,设置 `dp2 tp8`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:815
|
||||
#: ../../source/tutorials/models/GLM4.x.md:810
|
||||
msgid ""
|
||||
"Low-latency: For short sequences with low latency: we recommend setting "
|
||||
"`dp2 tp8`"
|
||||
msgstr "低延迟:对于需要低延迟的短序列,我们推荐设置 `dp2 tp8`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:816
|
||||
#: ../../source/tutorials/models/GLM4.x.md:811
|
||||
msgid ""
|
||||
"High-throughput: For short sequences with high throughput: we also "
|
||||
"recommend setting `dp2 tp8`"
|
||||
msgstr "高吞吐量:对于需要高吞吐量的短序列,我们也推荐设置 `dp2 tp8`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:818
|
||||
#: ../../source/tutorials/models/GLM4.x.md:813
|
||||
msgid ""
|
||||
"**Notice:** `max-model-len` and `max-num-seqs` need to be set according "
|
||||
"to the actual usage scenario. For other settings, please refer to the "
|
||||
"**[Deployment](#deployment)** chapter."
|
||||
msgstr ""
|
||||
"**注意:** `max-model-len` 和 `max-num-seqs` 需要根据实际使用场景进行设置。其他设置请参考 **[部署](#deployment)** 章节。"
|
||||
"**注意:** `max-model-len` 和 `max-num-seqs` 需要根据实际使用场景进行设置。其他设置请参考 "
|
||||
"**[部署](#deployment)** 章节。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:821
|
||||
#: ../../source/tutorials/models/GLM4.x.md:816
|
||||
msgid "FAQ"
|
||||
msgstr "常见问题"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:823
|
||||
#: ../../source/tutorials/models/GLM4.x.md:818
|
||||
msgid "**Q: Why is the TPOT performance poor in Long-context test?**"
|
||||
msgstr "**问:为什么在长上下文测试中TPOT性能不佳?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:825
|
||||
#: ../../source/tutorials/models/GLM4.x.md:820
|
||||
msgid ""
|
||||
"A: Please ensure that the FIA operator replacement script has been "
|
||||
"executed successfully to complete the replacement of FIA operators. Here "
|
||||
@@ -501,28 +535,28 @@ msgid ""
|
||||
"[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh) and"
|
||||
" [A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)"
|
||||
msgstr ""
|
||||
"答:请确保已成功执行FIA算子替换脚本以完成FIA算子的替换。脚本如下:"
|
||||
"[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh) 和 "
|
||||
"[A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)"
|
||||
"答:请确保已成功执行FIA算子替换脚本以完成FIA算子的替换。脚本如下:[A2](../../../../tools/install_flash_infer_attention_score_ops_a2.sh)"
|
||||
" 和 [A3](../../../../tools/install_flash_infer_attention_score_ops_a3.sh)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:827
|
||||
#: ../../source/tutorials/models/GLM4.x.md:822
|
||||
msgid ""
|
||||
"**Q: Startup fails with HCCL port conflicts (address already bound). What"
|
||||
" should I do?**"
|
||||
msgstr "**问:启动失败,提示HCCL端口冲突(地址已被占用)。我该怎么办?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:829
|
||||
#: ../../source/tutorials/models/GLM4.x.md:824
|
||||
msgid "A: Clean up old processes and restart: `pkill -f VLLM*`."
|
||||
msgstr "答:清理旧进程并重启:`pkill -f VLLM*`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:831
|
||||
#: ../../source/tutorials/models/GLM4.x.md:826
|
||||
msgid "**Q: How to handle OOM or unstable startup?**"
|
||||
msgstr "**问:如何处理OOM或启动不稳定的问题?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM4.x.md:833
|
||||
#: ../../source/tutorials/models/GLM4.x.md:828
|
||||
msgid ""
|
||||
"A: Reduce `--max-num-seqs` and `--max-model-len` first. If needed, reduce"
|
||||
" concurrency and load-testing pressure (e.g., `max-concurrency` / `num-"
|
||||
"prompts`)."
|
||||
msgstr ""
|
||||
"答:首先减少 `--max-num-seqs` 和 `--max-model-len`。如有需要,降低并发度和负载测试压力(例如,`max-concurrency` / `num-prompts`)。"
|
||||
"答:首先减少 `--max-num-seqs` 和 `--max-model-len`。如有需要,降低并发度和负载测试压力(例如,`max-"
|
||||
"concurrency` / `num-prompts`)。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -30,10 +30,11 @@ msgstr "简介"
|
||||
#: ../../source/tutorials/models/GLM5.md:5
|
||||
msgid ""
|
||||
"[GLM-5](https://huggingface.co/zai-org/GLM-5) use a Mixture-of-Experts "
|
||||
"(MoE) architecture and targeting at complex systems engineering and long-"
|
||||
"(MoE) architecture and targets at complex systems engineering and long-"
|
||||
"horizon agentic tasks."
|
||||
msgstr ""
|
||||
"[GLM-5](https://huggingface.co/zai-org/GLM-5) 采用混合专家 (Mixture-of-Experts, MoE) 架构,旨在处理复杂系统工程和长视野智能体任务。"
|
||||
"[GLM-5](https://huggingface.co/zai-org/GLM-5) 采用混合专家 (Mixture-of-Experts,"
|
||||
" MoE) 架构,旨在处理复杂系统工程和长视野智能体任务。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:7
|
||||
msgid ""
|
||||
@@ -41,7 +42,8 @@ msgid ""
|
||||
"`vllm-ascend:v0.17.0rc1` and `vllm-ascend:v0.18.0rc1` , the version of "
|
||||
"transformers need to be upgraded to 5.2.0."
|
||||
msgstr ""
|
||||
"`GLM-5` 模型首次在 `vllm-ascend:v0.17.0rc1` 版本中得到支持。在 `vllm-ascend:v0.17.0rc1` 和 `vllm-ascend:v0.18.0rc1` 版本中,需要将 transformers 的版本升级到 5.2.0。"
|
||||
"`GLM-5` 模型首次在 `vllm-ascend:v0.17.0rc1` 版本中得到支持。在 `vllm-ascend:v0.17.0rc1`"
|
||||
" 和 `vllm-ascend:v0.18.0rc1` 版本中,需要将 transformers 的版本升级到 5.2.0。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:9
|
||||
msgid ""
|
||||
@@ -49,8 +51,7 @@ msgid ""
|
||||
"including supported features, feature configuration, environment "
|
||||
"preparation, single-node and multi-node deployment, accuracy and "
|
||||
"performance evaluation."
|
||||
msgstr ""
|
||||
"本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。"
|
||||
msgstr "本文档将展示该模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点和多节点部署、精度和性能评估。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:11
|
||||
msgid "Supported Features"
|
||||
@@ -61,15 +62,13 @@ msgid ""
|
||||
"Refer to [supported "
|
||||
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
||||
" model's supported feature matrix."
|
||||
msgstr ""
|
||||
"请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
msgstr "请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:15
|
||||
msgid ""
|
||||
"Refer to [feature guide](../../user_guide/feature_guide/index.md) to get "
|
||||
"the feature's configuration."
|
||||
msgstr ""
|
||||
"请参考[特性指南](../../user_guide/feature_guide/index.md)以获取特性的配置方法。"
|
||||
msgstr "请参考[特性指南](../../user_guide/feature_guide/index.md)以获取特性的配置方法。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:17
|
||||
msgid "Environment Preparation"
|
||||
@@ -84,35 +83,34 @@ msgid ""
|
||||
"`GLM-5`(BF16 version): [Download model "
|
||||
"weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5)."
|
||||
msgstr ""
|
||||
"`GLM-5` (BF16 版本): [下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-5)。"
|
||||
"`GLM-5` (BF16 版本): "
|
||||
"[下载模型权重](https://www.modelscope.cn/models/ZhipuAI/GLM-5)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:22
|
||||
msgid ""
|
||||
"`GLM-5-w4a8`: [Download model weight](https://modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-5-w4a8)."
|
||||
msgstr ""
|
||||
"`GLM-5-w4a8`: [下载模型权重](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8)。"
|
||||
msgstr "`GLM-5-w4a8`: [下载模型权重](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:23
|
||||
msgid ""
|
||||
"`GLM-5-w8a8`: [Download model weight](https://www.modelscope.cn/models"
|
||||
"/Eco-Tech/GLM-5-w8a8)."
|
||||
msgstr ""
|
||||
"`GLM-5-w8a8`: [下载模型权重](https://www.modelscope.cn/models/Eco-Tech/GLM-5-w8a8)。"
|
||||
"`GLM-5-w8a8`: [下载模型权重](https://www.modelscope.cn/models/Eco-"
|
||||
"Tech/GLM-5-w8a8)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:24
|
||||
msgid ""
|
||||
"You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to "
|
||||
"quantify the model naively."
|
||||
msgstr ""
|
||||
"您可以使用 [msmodelslim](https://gitcode.com/Ascend/msmodelslim) 对模型进行简单的量化。"
|
||||
msgstr "您可以使用 [msmodelslim](https://gitcode.com/Ascend/msmodelslim) 对模型进行简单的量化。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:26
|
||||
msgid ""
|
||||
"It is recommended to download the model weight to the shared directory of"
|
||||
" multiple nodes, such as `/root/.cache/`"
|
||||
msgstr ""
|
||||
"建议将模型权重下载到多个节点的共享目录中,例如 `/root/.cache/`"
|
||||
msgstr "建议将模型权重下载到多个节点的共享目录中,例如 `/root/.cache/`"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:28
|
||||
msgid "Installation"
|
||||
@@ -146,7 +144,8 @@ msgid ""
|
||||
"Install `vllm-ascend` from source, refer to "
|
||||
"[installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)."
|
||||
msgstr ""
|
||||
"从源码安装 `vllm-ascend`,请参考[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)。"
|
||||
"从源码安装 `vllm-"
|
||||
"ascend`,请参考[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:123
|
||||
msgid ""
|
||||
@@ -200,7 +199,9 @@ msgid ""
|
||||
"optimize inference efficiency. It allows non-blocking task scheduling to "
|
||||
"improve concurrency and throughput, especially when processing large-"
|
||||
"scale models."
|
||||
msgstr "`--async-scheduling` 异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,尤其是在处理大规模模型时。"
|
||||
msgstr ""
|
||||
"`--async-scheduling` "
|
||||
"异步调度是一种用于优化推理效率的技术。它允许非阻塞的任务调度,以提高并发性和吞吐量,尤其是在处理大规模模型时。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:254
|
||||
msgid "Multi-node Deployment"
|
||||
@@ -211,7 +212,9 @@ msgid ""
|
||||
"If you want to deploy multi-node environment, you need to verify multi-"
|
||||
"node communication according to [verify multi-node communication "
|
||||
"environment](../../installation.md#verify-multi-node-communication)."
|
||||
msgstr "如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-communication)来验证多节点通信。"
|
||||
msgstr ""
|
||||
"如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||
"communication)来验证多节点通信。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:265
|
||||
msgid "`glm-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16)."
|
||||
@@ -240,7 +243,9 @@ msgid ""
|
||||
"For bf16 weight, use this script on each node to enable [Multi Token "
|
||||
"Prediction "
|
||||
"(MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md)."
|
||||
msgstr "对于 bf16 权重,在每个节点上使用此脚本来启用[多令牌预测 (MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md)。"
|
||||
msgstr ""
|
||||
"对于 bf16 权重,在每个节点上使用此脚本来启用[多令牌预测 "
|
||||
"(MTP)](../../user_guide/feature_guide/Multi_Token_Prediction.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:526
|
||||
msgid "`glm-5-w8a8`: require 2 Atlas 800 A3 (64G × 16)."
|
||||
@@ -276,200 +281,221 @@ msgid ""
|
||||
"deployment, `layer_sharding` is supported only on prefill/P nodes with "
|
||||
"`kv_role=\"kv_producer\"`; do not enable it on decode/D nodes or "
|
||||
"`kv_role=\"kv_both\"` nodes."
|
||||
msgstr "为了在预填充阶段支持 200k 的上下文窗口,需要在每个预填充节点的 `--additional_config` 中添加参数 `\"layer_sharding\": [\"q_b_proj\"]`。在 PD 解耦部署中,`layer_sharding` 仅在 `kv_role=\"kv_producer\"` 的预填充/P 节点上受支持;不要在解码/D 节点或 `kv_role=\"kv_both\"` 的节点上启用它。"
|
||||
msgstr ""
|
||||
"为了在预填充阶段支持 200k 的上下文窗口,需要在每个预填充节点的 `--additional_config` 中添加参数 "
|
||||
"`\"layer_sharding\": [\"q_b_proj\"]`。在 PD 解耦部署中,`layer_sharding` 仅在 "
|
||||
"`kv_role=\"kv_producer\"` 的预填充/P 节点上受支持;不要在解码/D 节点或 `kv_role=\"kv_both\"`"
|
||||
" 的节点上启用它。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:747
|
||||
#: ../../source/tutorials/models/GLM5.md:1233
|
||||
#: ../../source/tutorials/models/GLM5.md:1231
|
||||
msgid "Prefill node 0"
|
||||
msgstr "预填充节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:826
|
||||
#: ../../source/tutorials/models/GLM5.md:1240
|
||||
#: ../../source/tutorials/models/GLM5.md:825
|
||||
#: ../../source/tutorials/models/GLM5.md:1238
|
||||
msgid "Prefill node 1"
|
||||
msgstr "预填充节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:906
|
||||
#: ../../source/tutorials/models/GLM5.md:1247
|
||||
#: ../../source/tutorials/models/GLM5.md:904
|
||||
#: ../../source/tutorials/models/GLM5.md:1245
|
||||
msgid "Decode node 0"
|
||||
msgstr "解码节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:988
|
||||
#: ../../source/tutorials/models/GLM5.md:1254
|
||||
#: ../../source/tutorials/models/GLM5.md:986
|
||||
#: ../../source/tutorials/models/GLM5.md:1252
|
||||
msgid "Decode node 1"
|
||||
msgstr "解码节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1069
|
||||
#: ../../source/tutorials/models/GLM5.md:1261
|
||||
#: ../../source/tutorials/models/GLM5.md:1067
|
||||
#: ../../source/tutorials/models/GLM5.md:1259
|
||||
msgid "Decode node 2"
|
||||
msgstr "解码节点 2"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1150
|
||||
#: ../../source/tutorials/models/GLM5.md:1268
|
||||
#: ../../source/tutorials/models/GLM5.md:1148
|
||||
#: ../../source/tutorials/models/GLM5.md:1266
|
||||
msgid "Decode node 3"
|
||||
msgstr "解码节点 3"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1231
|
||||
#: ../../source/tutorials/models/GLM5.md:1229
|
||||
msgid ""
|
||||
"Once the preparation is done, you can start the server with the following"
|
||||
" command on each node:"
|
||||
msgstr "准备工作完成后,您可以在每个节点上使用以下命令启动服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1275
|
||||
#: ../../source/tutorials/models/GLM5.md:1273
|
||||
msgid "Request Forwarding"
|
||||
msgstr "请求转发"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1277
|
||||
#: ../../source/tutorials/models/GLM5.md:1275
|
||||
msgid ""
|
||||
"To set up request forwarding, run the following script on any machine. "
|
||||
"You can get the proxy program in the repository's examples: "
|
||||
"[load_balance_proxy_server_example.py](https://github.com/vllm-project"
|
||||
"/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"要设置请求转发,请在任何机器上运行以下脚本。您可以在仓库的示例中找到代理程序:[load_balance_proxy_server_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1318
|
||||
#: ../../source/tutorials/models/GLM5.md:1316
|
||||
msgid "**Notice:**"
|
||||
msgstr "**注意:**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1320
|
||||
#: ../../source/tutorials/models/GLM5.md:1318
|
||||
msgid "Some configurations for optimization are shown below:"
|
||||
msgstr "以下是一些用于优化的配置:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1322
|
||||
#: ../../source/tutorials/models/GLM5.md:1320
|
||||
msgid ""
|
||||
"`VLLM_ASCEND_ENABLE_FLASHCOMM1`: Enable FlashComm optimization to reduce "
|
||||
"communication and computation overhead on prefill node. With FlashComm "
|
||||
"enabled, layer_sharding list cannot include o_proj as an element."
|
||||
msgstr "`VLLM_ASCEND_ENABLE_FLASHCOMM1`: 启用 FlashComm 优化以减少预填充节点上的通信和计算开销。启用 FlashComm 后,layer_sharding 列表不能包含 o_proj 作为元素。"
|
||||
msgstr ""
|
||||
"`VLLM_ASCEND_ENABLE_FLASHCOMM1`: 启用 FlashComm 优化以减少预填充节点上的通信和计算开销。启用 "
|
||||
"FlashComm 后,layer_sharding 列表不能包含 o_proj 作为元素。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1323
|
||||
#: ../../source/tutorials/models/GLM5.md:1321
|
||||
msgid ""
|
||||
"`VLLM_ASCEND_ENABLE_FUSED_MC2`: Enable following fused operators: "
|
||||
"dispatch_gmm_combine_decode and dispatch_ffn_combine operator."
|
||||
msgstr "`VLLM_ASCEND_ENABLE_FUSED_MC2`: 启用以下融合算子:dispatch_gmm_combine_decode 和 dispatch_ffn_combine 算子。"
|
||||
"dispatch_gmm_combine_decode and dispatch_ffn_combine operator. and please"
|
||||
" **note** that this environment variable can only be enabled on decode "
|
||||
"nodes."
|
||||
msgstr ""
|
||||
"`VLLM_ASCEND_ENABLE_FUSED_MC2`: 启用以下融合算子:dispatch_gmm_combine_decode 和 "
|
||||
"dispatch_ffn_combine 算子。并请**注意**,此环境变量仅可在解码节点上启用。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1324
|
||||
#: ../../source/tutorials/models/GLM5.md:1322
|
||||
msgid "`VLLM_ASCEND_ENABLE_MLAPO`: Enable fused operator MlaPreprocessOperation."
|
||||
msgstr "`VLLM_ASCEND_ENABLE_MLAPO`: 启用融合算子 MlaPreprocessOperation。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1326
|
||||
#: ../../source/tutorials/models/GLM5.md:1324
|
||||
msgid ""
|
||||
"Please refer to the following python file for further explanation and "
|
||||
"restrictions of the environment variables above: "
|
||||
"[envs.py](https://github.com/vllm-project/vllm-"
|
||||
"ascend/blob/main/vllm_ascend/envs.py)"
|
||||
msgstr "有关上述环境变量的进一步解释和限制,请参考以下 python 文件:[envs.py](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/envs.py)"
|
||||
msgstr ""
|
||||
"有关上述环境变量的进一步解释和限制,请参考以下 python 文件:[envs.py](https://github.com/vllm-"
|
||||
"project/vllm-ascend/blob/main/vllm_ascend/envs.py)"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1328
|
||||
#: ../../source/tutorials/models/GLM5.md:1326
|
||||
msgid "Functional Verification"
|
||||
msgstr "功能验证"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1330
|
||||
#: ../../source/tutorials/models/GLM5.md:1328
|
||||
msgid "Once your server is started, you can query the model with input prompts:"
|
||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1343
|
||||
#: ../../source/tutorials/models/GLM5.md:1341
|
||||
msgid "Accuracy Evaluation"
|
||||
msgstr "精度评估"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1345
|
||||
#: ../../source/tutorials/models/GLM5.md:1343
|
||||
msgid "Here are two accuracy evaluation methods."
|
||||
msgstr "以下是两种精度评估方法。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1347
|
||||
#: ../../source/tutorials/models/GLM5.md:1359
|
||||
#: ../../source/tutorials/models/GLM5.md:1345
|
||||
#: ../../source/tutorials/models/GLM5.md:1357
|
||||
msgid "Using AISBench"
|
||||
msgstr "使用AISBench"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1349
|
||||
#: ../../source/tutorials/models/GLM5.md:1347
|
||||
msgid ""
|
||||
"Refer to [Using "
|
||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||
"details."
|
||||
msgstr "详情请参考[使用AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1351
|
||||
#: ../../source/tutorials/models/GLM5.md:1349
|
||||
msgid "After execution, you can get the result."
|
||||
msgstr "执行后,您将获得结果。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1353
|
||||
#: ../../source/tutorials/models/GLM5.md:1351
|
||||
msgid "Using Language Model Evaluation Harness"
|
||||
msgstr "使用Language Model Evaluation Harness"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1355
|
||||
#: ../../source/tutorials/models/GLM5.md:1353
|
||||
msgid "Not tested yet."
|
||||
msgstr "尚未测试。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1357
|
||||
#: ../../source/tutorials/models/GLM5.md:1355
|
||||
msgid "Performance"
|
||||
msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1361
|
||||
#: ../../source/tutorials/models/GLM5.md:1359
|
||||
msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参考[使用AISBench进行性能评估](../../developer_guide/evaluation/using_ais_bench.md"
|
||||
"#execute-performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1363
|
||||
#: ../../source/tutorials/models/GLM5.md:1361
|
||||
msgid "Using vLLM Benchmark"
|
||||
msgstr "使用vLLM基准测试"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1365
|
||||
#: ../../source/tutorials/models/GLM5.md:1363
|
||||
msgid ""
|
||||
"Refer to [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) "
|
||||
"for more details."
|
||||
msgstr "更多详情请参考[vllm基准测试](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1367
|
||||
#: ../../source/tutorials/models/GLM5.md:1365
|
||||
msgid "Best Practices"
|
||||
msgstr "最佳实践"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1369
|
||||
#: ../../source/tutorials/models/GLM5.md:1367
|
||||
msgid ""
|
||||
"In this chapter, we recommend best practices in prefill-decode "
|
||||
"disaggregation scenario with 1P1D architecture using 4 Atlas 800 A3 (64G "
|
||||
"× 16):"
|
||||
msgstr "本章节,我们推荐在使用4台Atlas 800 A3(64G × 16)的1P1D架构下,预填充-解码分离场景的最佳实践:"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1371
|
||||
#: ../../source/tutorials/models/GLM5.md:1369
|
||||
msgid ""
|
||||
"Low-latency: We recommend setting `dp4 tp8` on prefill nodes and `dp4 "
|
||||
"tp8` on decode nodes for low latency situation."
|
||||
msgstr "低延迟场景:对于低延迟场景,我们建议在预填充节点上设置`dp4 tp8`,在解码节点上设置`dp4 tp8`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1372
|
||||
#: ../../source/tutorials/models/GLM5.md:1370
|
||||
msgid ""
|
||||
"High-throughput: `dp4 tp8` on prefill nodes and `dp8 tp4` on decode nodes"
|
||||
" is recommended for high throughput situation."
|
||||
msgstr "高吞吐场景:对于高吞吐场景,建议在预填充节点上设置`dp4 tp8`,在解码节点上设置`dp8 tp4`。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1374
|
||||
#: ../../source/tutorials/models/GLM5.md:1372
|
||||
msgid ""
|
||||
"**Notice:** `max-model-len` and `max-num-seqs` need to be set according "
|
||||
"to the actual usage scenario. For other settings, please refer to the "
|
||||
"**[Deployment](#deployment)** chapter."
|
||||
msgstr "**注意:** `max-model-len`和`max-num-seqs`需要根据实际使用场景进行设置。其他设置请参考**[部署](#deployment)**章节。"
|
||||
msgstr ""
|
||||
"**注意:** `max-model-len`和`max-num-"
|
||||
"seqs`需要根据实际使用场景进行设置。其他设置请参考**[部署](#deployment)**章节。"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1377
|
||||
#: ../../source/tutorials/models/GLM5.md:1375
|
||||
msgid "FAQ"
|
||||
msgstr "常见问题"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1379
|
||||
#: ../../source/tutorials/models/GLM5.md:1377
|
||||
msgid ""
|
||||
"**Q: How to solve ValueError: Tokenizer class TokenizersBackend does not "
|
||||
"exist or is not currently imported?**"
|
||||
msgstr "**问:如何解决ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported?**"
|
||||
msgstr ""
|
||||
"**问:如何解决ValueError: Tokenizer class TokenizersBackend does not exist or "
|
||||
"is not currently imported?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1381
|
||||
#: ../../source/tutorials/models/GLM5.md:1379
|
||||
msgid "A: Please update the version of transformers to 5.2.0"
|
||||
msgstr "答:请将transformers版本更新至5.2.0"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1383
|
||||
#: ../../source/tutorials/models/GLM5.md:1381
|
||||
msgid "**Q: How to enable function calling for GLM-5?**"
|
||||
msgstr "**问:如何为GLM-5启用函数调用功能?**"
|
||||
|
||||
#: ../../source/tutorials/models/GLM5.md:1385
|
||||
#: ../../source/tutorials/models/GLM5.md:1383
|
||||
msgid "A: Please add following configurations in vLLM startup command"
|
||||
msgstr "答:请在vLLM启动命令中添加以下配置"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,7 +35,9 @@ msgid ""
|
||||
"resolution visual encoder with the ERNIE-4.5-0.3B language model to "
|
||||
"enable accurate element recognition."
|
||||
msgstr ""
|
||||
"PaddleOCR-VL 是一款专为文档解析设计的 SOTA 且资源高效的模型。其核心组件是 PaddleOCR-VL-0.9B,一个紧凑而强大的视觉语言模型(VLM),它集成了 NaViT 风格的动态分辨率视觉编码器和 ERNIE-4.5-0.3B 语言模型,以实现精确的元素识别。"
|
||||
"PaddleOCR-VL 是一款专为文档解析设计的 SOTA 且资源高效的模型。其核心组件是 PaddleOCR-"
|
||||
"VL-0.9B,一个紧凑而强大的视觉语言模型(VLM),它集成了 NaViT 风格的动态分辨率视觉编码器和 ERNIE-4.5-0.3B "
|
||||
"语言模型,以实现精确的元素识别。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:7
|
||||
msgid ""
|
||||
@@ -44,8 +46,7 @@ msgid ""
|
||||
"preparation, single-node deployment, and functional verification. It is "
|
||||
"designed to help users quickly complete model deployment and "
|
||||
"verification."
|
||||
msgstr ""
|
||||
"本文档提供了完整的模型部署和验证的详细工作流程,包括支持的特性、环境准备、单节点部署和功能验证。旨在帮助用户快速完成模型部署和验证。"
|
||||
msgstr "本文档提供了完整的模型部署和验证的详细工作流程,包括支持的特性、环境准备、单节点部署和功能验证。旨在帮助用户快速完成模型部署和验证。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:9
|
||||
msgid "Supported Features"
|
||||
@@ -56,8 +57,7 @@ msgid ""
|
||||
"Refer to [supported "
|
||||
"features](../../user_guide/support_matrix/supported_models.md) to get the"
|
||||
" model's supported feature matrix."
|
||||
msgstr ""
|
||||
"请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
msgstr "请参考[支持的特性](../../user_guide/support_matrix/supported_models.md)以获取模型支持的特性矩阵。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:13
|
||||
msgid ""
|
||||
@@ -78,7 +78,8 @@ msgid ""
|
||||
"`PaddleOCR-VL-0.9B`: [PaddleOCR-"
|
||||
"VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)"
|
||||
msgstr ""
|
||||
"`PaddleOCR-VL-0.9B`: [PaddleOCR-VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)"
|
||||
"`PaddleOCR-VL-0.9B`: [PaddleOCR-"
|
||||
"VL-0.9B](https://www.modelscope.cn/models/PaddlePaddle/PaddleOCR-VL)"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:21
|
||||
msgid ""
|
||||
@@ -99,13 +100,15 @@ msgid ""
|
||||
"Select an image based on your machine type and start the docker image on "
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr "根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-docker)。"
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:51
|
||||
msgid ""
|
||||
"The 310P device is supported from version 0.15.0rc1. You need to select "
|
||||
"the corresponding image for installation."
|
||||
msgstr "310P 设备从版本 0.15.0rc1 开始支持。您需要选择对应的镜像进行安装。"
|
||||
"The Atlas 300 inference products are supported from version 0.15.0rc1. "
|
||||
"You need to select the corresponding image for installation."
|
||||
msgstr "Atlas 300 推理产品从版本 0.15.0rc1 开始支持。您需要选择对应的镜像进行安装。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:54
|
||||
msgid "Deployment"
|
||||
@@ -122,8 +125,9 @@ msgstr "单 NPU (PaddleOCR-VL)"
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:60
|
||||
msgid ""
|
||||
"PaddleOCR-VL supports single-node single-card deployment on the 910B4 and"
|
||||
" 310P platform. Follow these steps to start the inference service:"
|
||||
msgstr "PaddleOCR-VL 支持在 910B4 和 310P 平台上进行单节点单卡部署。请按照以下步骤启动推理服务:"
|
||||
" Atlas 300 inference products platform. Follow these steps to start the "
|
||||
"inference service:"
|
||||
msgstr "PaddleOCR-VL 支持在 910B4 和 Atlas 300 推理产品平台上进行单节点单卡部署。请按照以下步骤启动推理服务:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:62
|
||||
msgid ""
|
||||
@@ -144,18 +148,20 @@ msgid "Run the following script to start the vLLM server on single 910B4:"
|
||||
msgstr "运行以下脚本在单张 910B4 上启动 vLLM 服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md
|
||||
msgid "310P"
|
||||
msgstr "310P"
|
||||
msgid "Atlas 300 inference products"
|
||||
msgstr "Atlas 300 推理产品"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:97
|
||||
msgid "Run the following script to start the vLLM server on single 310P:"
|
||||
msgstr "运行以下脚本在单张 310P 上启动 vLLM 服务器:"
|
||||
msgid ""
|
||||
"Run the following script to start the vLLM server on single Atlas 300 "
|
||||
"inference products:"
|
||||
msgstr "运行以下脚本在单张 Atlas 300 推理产品上启动 vLLM 服务器:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:116
|
||||
msgid ""
|
||||
"The `--max_model_len` option is added to prevent errors when generating "
|
||||
"the attention operator mask on the 310P device."
|
||||
msgstr "添加 `--max_model_len` 选项是为了防止在 310P 设备上生成注意力算子掩码时出错。"
|
||||
"the attention operator mask on the Atlas 300 inference products."
|
||||
msgstr "添加 `--max_model_len` 选项是为了防止在 Atlas 300 推理产品上生成注意力算子掩码时出错。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:121
|
||||
msgid "Multiple NPU (PaddleOCR-VL)"
|
||||
@@ -204,7 +210,9 @@ msgid ""
|
||||
"DocLayoutV2 model to fully unleash the capabilities of the PaddleOCR-VL "
|
||||
"model, making it more consistent with the examples provided by the "
|
||||
"official PaddlePaddle documentation."
|
||||
msgstr "在上面的示例中,我们演示了如何使用 vLLM 推理 PaddleOCR-VL-0.9B 模型。通常,我们还需要集成 PP-DocLayoutV2 模型,以充分发挥 PaddleOCR-VL 模型的能力,使其更符合官方 PaddlePaddle 文档提供的示例。"
|
||||
msgstr ""
|
||||
"在上面的示例中,我们演示了如何使用 vLLM 推理 PaddleOCR-VL-0.9B 模型。通常,我们还需要集成 PP-DocLayoutV2 "
|
||||
"模型,以充分发挥 PaddleOCR-VL 模型的能力,使其更符合官方 PaddlePaddle 文档提供的示例。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:205
|
||||
msgid ""
|
||||
@@ -230,11 +238,13 @@ msgstr "使用以下命令启动容器:"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:235
|
||||
msgid ""
|
||||
"Install "
|
||||
"Install "
|
||||
"[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined)"
|
||||
" and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
" and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
msgstr ""
|
||||
"安装 [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined) 和 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
"安装 "
|
||||
"[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined)"
|
||||
" 和 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:246
|
||||
msgid "The OpenCV component may be missing:"
|
||||
@@ -252,11 +262,14 @@ msgstr "OM 推理"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:264
|
||||
msgid ""
|
||||
"The 310P device supports only the OM model inference. For details about "
|
||||
"the process, see the guide provided in "
|
||||
"The Atlas 300 inference products support only the OM model inference. For"
|
||||
" details about the process, see the guide provided in "
|
||||
"[ModelZoo](https://gitcode.com/Ascend/ModelZoo-"
|
||||
"PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2)."
|
||||
msgstr "310P 设备仅支持 OM 模型推理。有关该过程的详细信息,请参阅 [ModelZoo](https://gitcode.com/Ascend/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2) 中提供的指南。"
|
||||
msgstr ""
|
||||
"Atlas 300 推理产品仅支持 OM 模型推理。有关该过程的详细信息,请参阅 [ModelZoo](https://gitcode.com/Ascend"
|
||||
"/ModelZoo-PyTorch/tree/master/ACL_PyTorch/built-in/ocr/PP-DocLayoutV2) "
|
||||
"中提供的指南。"
|
||||
|
||||
#: ../../source/tutorials/models/PaddleOCR-VL.md:268
|
||||
msgid ""
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -51,7 +51,8 @@ msgid ""
|
||||
"demonstration, showcasing the `Qwen3-VL-8B-Instruct` model as an example "
|
||||
"for single NPU deployment and the `Qwen2.5-VL-32B-Instruct` model as an "
|
||||
"example for multi-NPU deployment."
|
||||
msgstr "本教程使用 vLLM-Ascend `v0.11.0rc3-a3` 版本进行演示,以 `Qwen3-VL-8B-Instruct` 模型为例展示单NPU部署,以 `Qwen2.5-VL-32B-Instruct` 模型为例展示多NPU部署。"
|
||||
msgstr ""
|
||||
"本教程使用 vLLM-Ascend `v0.11.0rc3-a3` 版本进行演示,以 `Qwen3-VL-8B-Instruct` 模型为例展示单NPU部署,以 `Qwen2.5-VL-32B-Instruct` 模型为例展示多NPU部署。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:11
|
||||
msgid "Supported Features"
|
||||
@@ -86,56 +87,65 @@ msgstr "需要 1 个 Atlas 800I A2 (64G × 8) 节点或 1 个 Atlas 800 A3 (64G
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-3B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-3B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-3B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:24
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-7B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-7B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-7B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:25
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-32B-Instruct`:[Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-32B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-32B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:26
|
||||
msgid ""
|
||||
"`Qwen2.5-VL-72B-Instruct`:[Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)"
|
||||
msgstr "`Qwen2.5-VL-72B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen2.5-VL-72B-Instruct`:[下载模型权重](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:27
|
||||
msgid ""
|
||||
"`Qwen3-VL-2B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)"
|
||||
msgstr "`Qwen3-VL-2B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-2B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-2B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:28
|
||||
msgid ""
|
||||
"`Qwen3-VL-4B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)"
|
||||
msgstr "`Qwen3-VL-4B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-4B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-4B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:29
|
||||
msgid ""
|
||||
"`Qwen3-VL-8B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)"
|
||||
msgstr "`Qwen3-VL-8B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-8B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-8B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:30
|
||||
msgid ""
|
||||
"`Qwen3-VL-32B-Instruct`: [Download model "
|
||||
"weight](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)"
|
||||
msgstr "`Qwen3-VL-32B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)"
|
||||
msgstr ""
|
||||
"`Qwen3-VL-32B-Instruct`: [下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-VL-32B-Instruct)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:32
|
||||
msgid ""
|
||||
"A sample Qwen2.5-VL quantization script can be found in the modelslim "
|
||||
"code repository. [Qwen2.5-VL Quantization Script "
|
||||
"Example](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)"
|
||||
msgstr "可以在 modelslim 代码仓库中找到 Qwen2.5-VL 的量化脚本示例。[Qwen2.5-VL 量化脚本示例](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)"
|
||||
msgstr ""
|
||||
"可以在 modelslim 代码仓库中找到 Qwen2.5-VL 的量化脚本示例。[Qwen2.5-VL 量化脚本示例](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/multimodal_vlm/Qwen2.5-VL/README.md)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:34
|
||||
msgid ""
|
||||
@@ -172,8 +182,7 @@ msgid ""
|
||||
"memory. You can find more details "
|
||||
"[<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html)."
|
||||
msgstr ""
|
||||
"`max_split_size_mb` 可防止原生分配器拆分大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽前完成。您可以在"
|
||||
"[<u>此处</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html)找到更多详细信息。"
|
||||
"`max_split_size_mb` 可防止原生分配器拆分大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽前完成。您可以在[<u>此处</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html)找到更多详细信息。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:115
|
||||
msgid "Deployment"
|
||||
@@ -217,10 +226,10 @@ msgid ""
|
||||
"Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-"
|
||||
"Instruct model's max seq len (256000) is larger than the maximum number "
|
||||
"of tokens that can be stored in KV cache. This will differ with different"
|
||||
" NPU series based on the HBM size. Please modify the value according to a"
|
||||
" suitable value for your NPU series."
|
||||
" NPU series based on the on-chip memory size. Please modify the value "
|
||||
"according to a suitable value for your NPU series."
|
||||
msgstr ""
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen3-VL-8B-Instruct 模型的最大序列长度(256000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的 HBM 大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen3-VL-8B-Instruct 模型的最大序列长度(256000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的片上内存大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:335
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:422
|
||||
@@ -253,10 +262,10 @@ msgid ""
|
||||
"Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-"
|
||||
"Instruct model's max_model_len (128000) is larger than the maximum number"
|
||||
" of tokens that can be stored in KV cache. This will differ with "
|
||||
"different NPU series base on the HBM size. Please modify the value "
|
||||
"according to a suitable value for your NPU series."
|
||||
"different NPU series base on the on-chip memory size. Please modify the "
|
||||
"value according to a suitable value for your NPU series."
|
||||
msgstr ""
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen2.5-VL-32B-Instruct 模型的最大模型长度(128000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的 HBM 大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
"添加 `--max_model_len` 选项以避免 ValueError,该错误提示 Qwen2.5-VL-32B-Instruct 模型的最大模型长度(128000)大于 KV 缓存可存储的最大令牌数。此值因不同 NPU 系列的片上内存大小而异。请根据您 NPU 系列的合适值修改此值。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:468
|
||||
msgid "Accuracy Evaluation"
|
||||
@@ -292,7 +301,8 @@ msgid ""
|
||||
"Refer to [Using "
|
||||
"lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more "
|
||||
"details on `lm_eval` installation."
|
||||
msgstr "有关 `lm_eval` 安装的更多详细信息,请参考[使用 lm_eval](../../developer_guide/evaluation/using_lm_eval.md)。"
|
||||
msgstr ""
|
||||
"有关 `lm_eval` 安装的更多详细信息,请参考[使用 lm_eval](../../developer_guide/evaluation/using_lm_eval.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:492
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:523
|
||||
@@ -315,7 +325,8 @@ msgstr "以 `mmmu_val` 数据集作为测试数据集为例,在离线模式下
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen2.5"
|
||||
"-VL-32B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only."
|
||||
msgstr "执行后,您将获得结果。以下是 `vllm-ascend:0.11.0rc3` 中 `Qwen2.5-VL-32B-Instruct` 的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您将获得结果。以下是 `vllm-ascend:0.11.0rc3` 中 `Qwen2.5-VL-32B-Instruct` 的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen-VL-Dense.md:543
|
||||
msgid "Performance"
|
||||
@@ -357,4 +368,4 @@ msgstr "性能评估必须在在线模式下进行。以 `serve` 为例。按如
|
||||
msgid ""
|
||||
"After about several minutes, you can get the performance evaluation "
|
||||
"result."
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,7 +35,8 @@ msgid ""
|
||||
"advancements in reasoning, instruction-following, agent capabilities, and"
|
||||
" multilingual support."
|
||||
msgstr ""
|
||||
"Qwen3 是 Qwen 系列最新一代的大语言模型,提供了一套完整的稠密模型和专家混合模型。基于广泛的训练,Qwen3 在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展。"
|
||||
"Qwen3 是 Qwen 系列最新一代的大语言模型,提供了一套完整的稠密模型和专家混合(MoE)模型。基于广泛的训练,Qwen3 "
|
||||
"在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:7
|
||||
msgid ""
|
||||
@@ -80,7 +81,9 @@ msgid ""
|
||||
"1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G × 8)nodes. [Download "
|
||||
"model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)"
|
||||
msgstr ""
|
||||
"`Qwen3-235B-A22B`(BF16 版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas 800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) 节点。[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)"
|
||||
"`Qwen3-235B-A22B`(BF16 版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas "
|
||||
"800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) "
|
||||
"节点。[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:22
|
||||
msgid ""
|
||||
@@ -89,7 +92,10 @@ msgid ""
|
||||
"8)nodes. [Download model weight](https://modelscope.cn/models/vllm-"
|
||||
"ascend/Qwen3-235B-A22B-W8A8)"
|
||||
msgstr ""
|
||||
"`Qwen3-235B-A22B-w8a8`(量化版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas 800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) 节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)"
|
||||
"`Qwen3-235B-A22B-w8a8`(量化版本):需要 1 个 Atlas 800 A3 (64G × 16) 节点、1 个 Atlas "
|
||||
"800 A2 (64G × 8) 节点或 2 个 Atlas 800 A2(32G × 8) "
|
||||
"节点。[下载模型权重](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-"
|
||||
"W8A8)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:24
|
||||
msgid ""
|
||||
@@ -106,7 +112,9 @@ msgid ""
|
||||
"If you want to deploy multi-node environment, you need to verify multi-"
|
||||
"node communication according to [verify multi-node communication "
|
||||
"environment](../../installation.md#verify-multi-node-communication)."
|
||||
msgstr "如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-communication)来验证多节点通信。"
|
||||
msgstr ""
|
||||
"如果您想部署多节点环境,需要根据[验证多节点通信环境](../../installation.md#verify-multi-node-"
|
||||
"communication)来验证多节点通信。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:30
|
||||
msgid "Installation"
|
||||
@@ -121,14 +129,18 @@ msgid ""
|
||||
"For example, using images `quay.io/ascend/vllm-ascend:v0.11.0rc2`(for "
|
||||
"Atlas 800 A2) and `quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(for Atlas "
|
||||
"800 A3)."
|
||||
msgstr "例如,使用镜像 `quay.io/ascend/vllm-ascend:v0.11.0rc2`(适用于 Atlas 800 A2)和 `quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(适用于 Atlas 800 A3)。"
|
||||
msgstr ""
|
||||
"例如,使用镜像 `quay.io/ascend/vllm-ascend:v0.11.0rc2`(适用于 Atlas 800 A2)和 "
|
||||
"`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`(适用于 Atlas 800 A3)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:38
|
||||
msgid ""
|
||||
"Select an image based on your machine type and start the docker image on "
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr "根据您的机器类型选择镜像并在节点上启动 Docker 容器,请参考[使用 Docker](../../installation.md#set-up-using-docker)。"
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 容器,请参考[使用 Docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md
|
||||
msgid "Build from source"
|
||||
@@ -142,7 +154,9 @@ msgstr "您可以从源码构建所有组件。"
|
||||
msgid ""
|
||||
"Install `vllm-ascend`, refer to [set up using "
|
||||
"python](../../installation.md#set-up-using-python)."
|
||||
msgstr "安装 `vllm-ascend`,请参考[使用 Python 设置](../../installation.md#set-up-using-python)。"
|
||||
msgstr ""
|
||||
"安装 `vllm-ascend`,请参考[使用 Python 设置](../../installation.md#set-up-using-"
|
||||
"python)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:84
|
||||
msgid ""
|
||||
@@ -163,7 +177,10 @@ msgid ""
|
||||
"`Qwen3-235B-A22B` and `Qwen3-235B-A22B-w8a8` can both be deployed on 1 "
|
||||
"Atlas 800 A3(64G*16), 1 Atlas 800 A2(64G*8). Quantized version need to "
|
||||
"start with parameter `--quantization ascend`."
|
||||
msgstr "`Qwen3-235B-A22B` 和 `Qwen3-235B-A22B-w8a8` 都可以部署在 1 个 Atlas 800 A3(64G*16) 或 1 个 Atlas 800 A2(64G*8) 上。量化版本需要使用参数 `--quantization ascend` 启动。"
|
||||
msgstr ""
|
||||
"`Qwen3-235B-A22B` 和 `Qwen3-235B-A22B-w8a8` 都可以部署在 1 个 Atlas 800 "
|
||||
"A3(64G*16) 或 1 个 Atlas 800 A2(64G*8) 上。量化版本需要使用参数 `--quantization ascend`"
|
||||
" 启动。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:93
|
||||
msgid "Run the following script to execute online 128k inference."
|
||||
@@ -181,7 +198,10 @@ msgid ""
|
||||
"quantization weights to run long seqs (such as 128k context), it is "
|
||||
"required to use yarn rope-scaling technique."
|
||||
msgstr ""
|
||||
"[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) 原本仅支持 40960 上下文长度(max_position_embeddings)。如果您想使用它及其相关的量化权重来运行长序列(例如 128k 上下文),需要使用 yarn rope-scaling 技术。"
|
||||
"[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-"
|
||||
"long-texts) 原本仅支持 40960 "
|
||||
"上下文长度(max_position_embeddings)。如果您想使用它及其相关的量化权重来运行长序列(例如 128k 上下文),需要使用 "
|
||||
"yarn rope-scaling 技术。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:129
|
||||
#, python-brace-format
|
||||
@@ -192,7 +212,8 @@ msgid ""
|
||||
" \\`."
|
||||
msgstr ""
|
||||
"对于 `v0.12.0` 及以上版本的 vLLM,使用参数:`--hf-overrides '{\"rope_parameters\": "
|
||||
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}' \\`。"
|
||||
"{\"rope_type\":\"yarn\",\"rope_theta\":1000000,\"factor\":4,\"original_max_position_embeddings\":32768}}'"
|
||||
" \\`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:130
|
||||
#, python-brace-format
|
||||
@@ -205,7 +226,10 @@ msgid ""
|
||||
"parameter."
|
||||
msgstr ""
|
||||
"对于 `v0.12.0` 以下版本的 vLLM,使用参数:`--rope_scaling "
|
||||
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}' \\`。如果您使用的是像 [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) 这样原本就支持长上下文的权重,则无需添加此参数。"
|
||||
"'{\"rope_type\":\"yarn\",\"factor\":4,\"original_max_position_embeddings\":32768}'"
|
||||
" \\`。如果您使用的是像 [Qwen3-235B-A22B-"
|
||||
"Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)"
|
||||
" 这样原本就支持长上下文的权重,则无需添加此参数。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:133
|
||||
msgid "The parameters are explained as follows:"
|
||||
@@ -215,7 +239,9 @@ msgstr "参数解释如下:"
|
||||
msgid ""
|
||||
"`--data-parallel-size` 1 and `--tensor-parallel-size` 8 are common "
|
||||
"settings for data parallelism (DP) and tensor parallelism (TP) sizes."
|
||||
msgstr "`--data-parallel-size` 1 和 `--tensor-parallel-size` 8 是数据并行(DP)和张量并行(TP)大小的常见设置。"
|
||||
msgstr ""
|
||||
"`--data-parallel-size` 1 和 `--tensor-parallel-size` 8 "
|
||||
"是数据并行(DP)和张量并行(TP)大小的常见设置。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:136
|
||||
msgid ""
|
||||
@@ -233,21 +259,28 @@ msgid ""
|
||||
"testing performance, it is generally recommended that `--max-num-seqs` * "
|
||||
"`--data-parallel-size` >= the actual total concurrency."
|
||||
msgstr ""
|
||||
"`--max-num-seqs` 表示每个 DP 组允许处理的最大请求数。如果发送到服务的请求数超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= 实际总并发数。"
|
||||
"`--max-num-seqs` 表示每个 DP "
|
||||
"组允许处理的最大请求数。如果发送到服务的请求数超过此限制,超出的请求将保持在等待状态,不会被调度。请注意,在等待状态所花费的时间也会计入 TTFT"
|
||||
" 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= "
|
||||
"实际总并发数。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:138
|
||||
msgid ""
|
||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||
"enables ChunkPrefill/SplitFuse by default, which means:"
|
||||
msgstr "`--max-num-batched-tokens` 表示模型在单步中可以处理的最大 token 数。目前,vLLM v1 调度默认启用 ChunkPrefill/SplitFuse,这意味着:"
|
||||
msgstr ""
|
||||
"`--max-num-batched-tokens` 表示模型在单步中可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||
"ChunkPrefill/SplitFuse,这意味着:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:139
|
||||
msgid ""
|
||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||
"tokens`, it will be divided into multiple rounds of computation according"
|
||||
" to `--max-num-batched-tokens`;"
|
||||
msgstr "(1) 如果一个请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-tokens` 被分成多轮计算;"
|
||||
msgstr ""
|
||||
"(1) 如果一个请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-"
|
||||
"tokens` 被分成多轮计算;"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:140
|
||||
msgid ""
|
||||
@@ -277,14 +310,21 @@ msgid ""
|
||||
"memory-utilization` too high may lead to OOM (Out of Memory) issues "
|
||||
"during actual inference. The default value is `0.9`."
|
||||
msgstr ""
|
||||
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache 大小。在预热阶段(在 vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens` 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可以使用的 kv_cache 就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理期间不同(例如,由于 EP 负载不均),将 `--gpu-memory-utilization` 设置得过高可能会导致实际推理期间出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||
"`--gpu-memory-utilization` 表示 vLLM 将用于实际推理的 HBM 比例。其核心功能是计算可用的 kv_cache "
|
||||
"大小。在预热阶段(在 vLLM 中称为 profile run),vLLM 会记录输入大小为 `--max-num-batched-tokens`"
|
||||
" 的推理过程中的峰值 GPU 内存使用量。然后,可用的 kv_cache 大小计算为:`--gpu-memory-utilization` * "
|
||||
"HBM 大小 - 峰值 GPU 内存使用量。因此,`--gpu-memory-utilization` 的值越大,可以使用的 kv_cache "
|
||||
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理期间不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||
"utilization` 设置得过高可能会导致实际推理期间出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:143
|
||||
msgid ""
|
||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||
"use pure EP or pure TP."
|
||||
msgstr "`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE 可以使用纯 EP 或纯 TP。"
|
||||
msgstr ""
|
||||
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||
"可以使用纯 EP 或纯 TP。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:144
|
||||
msgid ""
|
||||
@@ -308,7 +348,10 @@ msgid ""
|
||||
"mainly used to reduce the cost of operator dispatch. Currently, "
|
||||
"\"FULL_DECODE_ONLY\" is recommended."
|
||||
msgstr ""
|
||||
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和 \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 \"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 \"FULL_DECODE_ONLY\"。"
|
||||
"`--compilation-config` 包含与 aclgraph 图模式相关的配置。最重要的配置是 \"cudagraph_mode\" 和"
|
||||
" \"cudagraph_capture_sizes\",其含义如下:\"cudagraph_mode\":表示特定的图模式。目前支持 "
|
||||
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||
"\"FULL_DECODE_ONLY\"。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:148
|
||||
msgid ""
|
||||
@@ -319,14 +362,18 @@ msgid ""
|
||||
"Currently, the default setting is recommended. Only in some scenarios is "
|
||||
"it necessary to set this separately to achieve optimal performance."
|
||||
msgstr ""
|
||||
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||
"\"cudagraph_capture_sizes\":表示不同级别的图模式。默认值为 [1, 2, 4, 8, 16, 24, 32, "
|
||||
"40,..., `--max-num-"
|
||||
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:149
|
||||
msgid ""
|
||||
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 "
|
||||
"optimization is enabled. Currently, this optimization is only supported "
|
||||
"for MoE in scenarios where tp_size > 1."
|
||||
msgstr "`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 tp_size > 1 的场景下对 MoE 支持。"
|
||||
msgstr ""
|
||||
"`export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` 表示启用了 Flashcomm1 优化。目前,此优化仅在 "
|
||||
"tp_size > 1 的场景下对 MoE 支持。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:151
|
||||
msgid "Multi-node Deployment with MP (Recommended)"
|
||||
@@ -336,7 +383,9 @@ msgstr "使用 MP 进行多节点部署(推荐)"
|
||||
msgid ""
|
||||
"Assume you have Atlas 800 A3 (64G*16) nodes (or 2* A2), and want to "
|
||||
"deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes."
|
||||
msgstr "假设您有 Atlas 800 A3 (64G*16) 节点(或 2* A2),并希望跨多个节点部署 `Qwen3-VL-235B-A22B-Instruct` 模型。"
|
||||
msgstr ""
|
||||
"假设您有 Atlas 800 A3 (64G*16) 节点(或 2* A2),并希望跨多个节点部署 `Qwen3-VL-235B-A22B-"
|
||||
"Instruct` 模型。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:155
|
||||
msgid "Node 0"
|
||||
@@ -368,7 +417,9 @@ msgstr "预填充-解码分离"
|
||||
msgid ""
|
||||
"refer to [Prefill-Decode Disaggregation Mooncake Verification "
|
||||
"(Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)"
|
||||
msgstr "请参阅 [Prefill-Decode 分离部署 Mooncake 验证 (Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)"
|
||||
msgstr ""
|
||||
"请参阅 [Prefill-Decode 分离部署 Mooncake 验证 "
|
||||
"(Qwen)](../features/pd_disaggregation_mooncake_multi_node.md)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:262
|
||||
msgid "Functional Verification"
|
||||
@@ -453,7 +504,10 @@ msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参阅 [使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参阅 [使用 AISBench "
|
||||
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:297
|
||||
msgid "Using vLLM Benchmark"
|
||||
@@ -542,13 +596,13 @@ msgstr "单节点 A3 (64G*16)"
|
||||
msgid "Example server scripts:"
|
||||
msgstr "服务器脚本示例:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:368
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:597
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:367
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:595
|
||||
msgid "Benchmark scripts:"
|
||||
msgstr "基准测试脚本:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:384
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:613
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:383
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:611
|
||||
msgid "Reference test results:"
|
||||
msgstr "参考测试结果:"
|
||||
|
||||
@@ -592,48 +646,53 @@ msgstr "48.69"
|
||||
msgid "2761.72"
|
||||
msgstr "2761.72"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:390
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:619
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:389
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:617
|
||||
msgid "Note:"
|
||||
msgstr "注意:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:392
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:391
|
||||
msgid ""
|
||||
"Setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` enables MoE fused "
|
||||
"operators that reduce time consumption of MoE in both prefill and decode."
|
||||
" This is an experimental feature which only supports W8A8 quantization on"
|
||||
" Atlas A3 servers now. If you encounter any problems when using this "
|
||||
"feature, you can disable it by setting `export "
|
||||
"VLLM_ASCEND_ENABLE_FUSED_MC2=0` and update issues in vLLM-Ascend "
|
||||
"community."
|
||||
msgstr "设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` 可启用 MoE 融合算子,以减少预填充和解码阶段 MoE 的时间消耗。这是一个实验性功能,目前仅支持 Atlas A3 服务器上的 W8A8 量化。如果您在使用此功能时遇到任何问题,可以通过设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` 来禁用它,并在 vLLM-Ascend 社区更新问题。"
|
||||
"operators that reduce time consumption of MoE in decode. This is an "
|
||||
"experimental feature which only supports W8A8 quantization on Atlas A3 "
|
||||
"servers now. If you encounter any problems when using this feature, you "
|
||||
"can disable it by setting `export VLLM_ASCEND_ENABLE_FUSED_MC2=0` and "
|
||||
"update issues in vLLM-Ascend community. **Note** that this environment "
|
||||
"variable can only be enabled on decode nodes."
|
||||
msgstr ""
|
||||
"设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=1` 可启用 MoE 融合算子,以减少解码阶段 MoE "
|
||||
"的时间消耗。这是一个实验性功能,目前仅支持 Atlas A3 服务器上的 W8A8 量化。如果您在使用此功能时遇到任何问题,可以通过设置 "
|
||||
"`export VLLM_ASCEND_ENABLE_FUSED_MC2=0` 来禁用它,并在 vLLM-Ascend 社区更新问题。**注意**,此环境变量只能在解码节点上启用。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:393
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:392
|
||||
msgid ""
|
||||
"Here we disable prefix cache because of random datasets. You can enable "
|
||||
"prefix cache if requests have long common prefix."
|
||||
msgstr "由于使用随机数据集,此处我们禁用了前缀缓存。如果请求具有较长的公共前缀,您可以启用前缀缓存。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:395
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:394
|
||||
msgid "Three Node A3 -- PD disaggregation"
|
||||
msgstr "三节点 A3 -- PD 分离部署"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:397
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:396
|
||||
msgid ""
|
||||
"On three Atlas 800 A3(64G*16) server, we recommend to use one node as one"
|
||||
" prefill instance and two nodes as one decode instance. Example server "
|
||||
"scripts: Prefill Node 1"
|
||||
msgstr "在三台 Atlas 800 A3(64G*16) 服务器上,我们建议使用一个节点作为一个预填充实例,两个节点作为一个解码实例。服务器脚本示例:预填充节点 1"
|
||||
msgstr ""
|
||||
"在三台 Atlas 800 A3(64G*16) "
|
||||
"服务器上,我们建议使用一个节点作为一个预填充实例,两个节点作为一个解码实例。服务器脚本示例:预填充节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:462
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:460
|
||||
msgid "Decode Node 1"
|
||||
msgstr "解码节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:526
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:524
|
||||
msgid "Decode Node 2"
|
||||
msgstr "解码节点 2"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:591
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:589
|
||||
msgid "PD proxy:"
|
||||
msgstr "PD 代理:"
|
||||
|
||||
@@ -657,9 +716,13 @@ msgstr "52.07"
|
||||
msgid "8593.44"
|
||||
msgstr "8593.44"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:621
|
||||
#: ../../source/tutorials/models/Qwen3-235B-A22B.md:619
|
||||
msgid ""
|
||||
"We recommend to set `export VLLM_ASCEND_ENABLE_FUSED_MC2=2` on this "
|
||||
"scenario (typically EP32 for Qwen3-235B). This enables a different MoE "
|
||||
"fusion operator."
|
||||
msgstr "在此场景下(通常 Qwen3-235B 使用 EP32),我们建议设置 `export VLLM_ASCEND_ENABLE_FUSED_MC2=2`。这将启用一个不同的 MoE 融合算子。"
|
||||
"fusion operator. **Note** that this environment variable can only be "
|
||||
"enabled on decode nodes."
|
||||
msgstr ""
|
||||
"在此场景下(通常 Qwen3-235B 使用 EP32),我们建议设置 `export "
|
||||
"VLLM_ASCEND_ENABLE_FUSED_MC2=2`。这将启用一个不同的 MoE 融合算子。"
|
||||
"**注意**:此环境变量只能在解码节点上启用。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,17 +29,15 @@ msgstr "简介"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:5
|
||||
msgid ""
|
||||
"Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation "
|
||||
"models. It processes text, images, audio, and video, and delivers real-"
|
||||
"Qwen3-Omni is a native end-to-end multilingual omni-modal foundation "
|
||||
"model. It processes text, images, audio, and video, and delivers real-"
|
||||
"time streaming responses in both text and natural speech. We introduce "
|
||||
"several architectural upgrades to improve performance and efficiency. The"
|
||||
" Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, "
|
||||
"equipped with chain-of-thought reasoning, supporting audio, video, and "
|
||||
"text input, with text output."
|
||||
" Thinking model of Qwen3-Omni-30B-A3B, which contains the thinker "
|
||||
"component, is equipped with chain-of-thought reasoning and supports "
|
||||
"audio, video, and text input, with text output."
|
||||
msgstr ""
|
||||
"Qwen3-Omni "
|
||||
"是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3"
|
||||
"-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。"
|
||||
"Qwen3-Omni 是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:7
|
||||
msgid ""
|
||||
@@ -54,21 +52,19 @@ msgstr "支持的功能"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:11
|
||||
msgid ""
|
||||
"Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/support_matrix/supported_models.html) to get the "
|
||||
"Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/support_matrix/supported_models.html) to get the "
|
||||
"model's supported feature matrix."
|
||||
msgstr ""
|
||||
"请参考 [支持的功能](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/support_matrix/supported_models.html) 以获取模型支持的功能矩阵。"
|
||||
"请参考 [支持的功能](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) 以获取模型支持的功能矩阵。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:13
|
||||
msgid ""
|
||||
"Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/feature_guide/index.html) to get the feature's "
|
||||
"Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/feature_guide/index.html) to get the feature's "
|
||||
"configuration."
|
||||
msgstr ""
|
||||
"请参考 [功能指南](https://docs.vllm.ai/projects/ascend/zh-"
|
||||
"cn/latest/user_guide/feature_guide/index.html) 以获取功能的配置信息。"
|
||||
"请参考 [功能指南](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) 以获取功能的配置信息。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:15
|
||||
msgid "Environment Preparation"
|
||||
@@ -83,17 +79,15 @@ msgid ""
|
||||
"`Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards (64G × 2).[Download "
|
||||
"model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
||||
"Thinking) It is recommended to download the model weight to the shared "
|
||||
"directory of multiple nodes, such as `/root/.cache/`"
|
||||
"directory of multiple nodes, such as `/root/.cache/`"
|
||||
msgstr ""
|
||||
"`Qwen3-Omni-30B-A3B-Thinking` 需要 2 张 NPU 卡 (64G × "
|
||||
"2)。[下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-"
|
||||
"Thinking)。建议将模型权重下载到多节点的共享目录,例如 `/root/.cache/`。"
|
||||
"`Qwen3-Omni-30B-A3B-Thinking` 需要 2 张 NPU 卡 (64G × 2)。[下载模型权重](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)。建议将模型权重下载到多节点的共享目录,例如 `/root/.cache/`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:22
|
||||
msgid "Installation"
|
||||
msgstr "安装"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:24
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
|
||||
msgid "Use docker image"
|
||||
msgstr "使用 Docker 镜像"
|
||||
|
||||
@@ -109,10 +103,9 @@ msgid ""
|
||||
"your node, refer to [using docker](../../installation.md#set-up-using-"
|
||||
"docker)."
|
||||
msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 [使用 Docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 [使用 Docker](../../installation.md#set-up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:32
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md
|
||||
msgid "Build from source"
|
||||
msgstr "从源码构建"
|
||||
|
||||
@@ -125,8 +118,7 @@ msgid ""
|
||||
"Install `vllm-ascend`, refer to [set up using "
|
||||
"python](../../installation.md#set-up-using-python)."
|
||||
msgstr ""
|
||||
"安装 `vllm-ascend`,请参考 [使用 Python 设置](../../installation.md#set-up-using-"
|
||||
"python)。"
|
||||
"安装 `vllm-ascend`,请参考 [使用 Python 设置](../../installation.md#set-up-using-python)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:71
|
||||
msgid "Please install system dependencies"
|
||||
@@ -159,8 +151,7 @@ msgid ""
|
||||
" least 1, and for 32 GB of memory, tensor-parallel-size should be at "
|
||||
"least 2."
|
||||
msgstr ""
|
||||
"运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-"
|
||||
"size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。"
|
||||
"运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:188
|
||||
msgid "Functional Verification"
|
||||
@@ -188,8 +179,7 @@ msgid ""
|
||||
"dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in "
|
||||
"online mode."
|
||||
msgstr ""
|
||||
"以 `gsm8k`、`omnibench`、`bbh` 数据集作为测试数据集为例,在在线模式下运行 `Qwen3-Omni-30B-A3B-"
|
||||
"Thinking` 的精度评估。"
|
||||
"以 `gsm8k`、`omnibench`、`bbh` 数据集作为测试数据集为例,在在线模式下运行 `Qwen3-Omni-30B-A3B-Thinking` 的精度评估。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:239
|
||||
msgid ""
|
||||
@@ -197,21 +187,19 @@ msgid ""
|
||||
"evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
||||
"#install-evalscope-using-pip>) for `evalscope`installation."
|
||||
msgstr ""
|
||||
"关于 `evalscope` 的安装,请参考使用 evalscope "
|
||||
"(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html"
|
||||
"#install-evalscope-using-pip>)。"
|
||||
"关于 `evalscope` 的安装,请参考使用 evalscope (<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:240
|
||||
msgid "Run `evalscope` to execute the accuracy evaluation."
|
||||
msgstr "运行 `evalscope` 以执行精度评估。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:255
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:296
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen3"
|
||||
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 "
|
||||
"中的结果,仅供参考。"
|
||||
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 中的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:269
|
||||
msgid "Performance"
|
||||
@@ -228,8 +216,7 @@ msgid ""
|
||||
"benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more "
|
||||
"details."
|
||||
msgstr ""
|
||||
"以运行 `Qwen3-Omni-30B-A3B-Thinking` 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 [vllm"
|
||||
" 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
"以运行 `Qwen3-Omni-30B-A3B-Thinking` 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 [vllm 基准测试](https://docs.vllm.ai/en/latest/benchmarking/)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:277
|
||||
msgid "There are three `vllm bench` subcommands:"
|
||||
@@ -249,12 +236,4 @@ msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:283
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
msgstr "以 `serve` 为例。按如下方式运行代码。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md:296
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen3"
|
||||
"-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only."
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果。以下是 `Qwen3-Omni-30B-A3B-Thinking` 在 vllm-ascend:0.13.0rc1 "
|
||||
"中的结果,仅供参考。"
|
||||
msgstr "以 `serve` 为例。按如下方式运行代码。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -118,7 +118,7 @@ msgstr ""
|
||||
msgid "Installation"
|
||||
msgstr "安装"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:34
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md
|
||||
msgid "Use docker image"
|
||||
msgstr "使用 Docker 镜像"
|
||||
|
||||
@@ -140,7 +140,7 @@ msgstr ""
|
||||
"根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考[使用 Docker](../../installation.md#set-"
|
||||
"up-using-docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:76
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md
|
||||
msgid "Build from source"
|
||||
msgstr "从源码构建"
|
||||
|
||||
@@ -185,15 +185,15 @@ msgid ""
|
||||
"A3(64G*16)."
|
||||
msgstr "在 1 个 Atlas 800 A3(64G*16) 上运行以下脚本以执行在线 128k 推理。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:133
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:132
|
||||
msgid "**Notice:**"
|
||||
msgstr "**注意:**"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:135
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:134
|
||||
msgid "The parameters are explained as follows:"
|
||||
msgstr "参数解释如下:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:137
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:136
|
||||
msgid ""
|
||||
"`--data-parallel-size` 1 and `--tensor-parallel-size` 16 are common "
|
||||
"settings for data parallelism (DP) and tensor parallelism (TP) sizes."
|
||||
@@ -201,13 +201,13 @@ msgstr ""
|
||||
"`--data-parallel-size` 1 和 `--tensor-parallel-size` 16 是数据并行 (DP) 和张量并行 "
|
||||
"(TP) 大小的常见设置。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:138
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:137
|
||||
msgid ""
|
||||
"`--max-model-len` represents the context length, which is the maximum "
|
||||
"value of the input plus output for a single request."
|
||||
msgstr "`--max-model-len` 表示上下文长度,即单个请求的输入加输出的最大值。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:139
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:138
|
||||
msgid ""
|
||||
"`--max-num-seqs` indicates the maximum number of requests that each DP "
|
||||
"group is allowed to process. If the number of requests sent to the "
|
||||
@@ -222,7 +222,7 @@ msgstr ""
|
||||
" 和 TPOT 等指标。因此,在测试性能时,通常建议 `--max-num-seqs` * `--data-parallel-size` >= "
|
||||
"实际总并发数。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:140
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:139
|
||||
msgid ""
|
||||
"`--max-num-batched-tokens` represents the maximum number of tokens that "
|
||||
"the model can process in a single step. Currently, vLLM v1 scheduling "
|
||||
@@ -231,7 +231,7 @@ msgstr ""
|
||||
"`--max-num-batched-tokens` 表示模型单步可以处理的最大 token 数。目前,vLLM v1 调度默认启用 "
|
||||
"ChunkPrefill/SplitFuse,这意味着:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:141
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:140
|
||||
msgid ""
|
||||
"(1) If the input length of a request is greater than `--max-num-batched-"
|
||||
"tokens`, it will be divided into multiple rounds of computation according"
|
||||
@@ -240,20 +240,20 @@ msgstr ""
|
||||
"(1) 如果请求的输入长度大于 `--max-num-batched-tokens`,它将根据 `--max-num-batched-"
|
||||
"tokens` 被分成多轮计算;"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:142
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:141
|
||||
msgid ""
|
||||
"(2) Decode requests are prioritized for scheduling, and prefill requests "
|
||||
"are scheduled only if there is available capacity."
|
||||
msgstr "(2) 解码请求优先调度,只有在有可用容量时才调度预填充请求。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:143
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:142
|
||||
msgid ""
|
||||
"Generally, if `--max-num-batched-tokens` is set to a larger value, the "
|
||||
"overall latency will be lower, but the pressure on GPU memory (activation"
|
||||
" value usage) will be greater."
|
||||
msgstr "通常,如果 `--max-num-batched-tokens` 设置得较大,整体延迟会更低,但 GPU 内存(激活值使用)的压力会更大。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:144
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:143
|
||||
msgid ""
|
||||
"`--gpu-memory-utilization` represents the proportion of HBM that vLLM "
|
||||
"will use for actual inference. Its essential function is to calculate the"
|
||||
@@ -275,7 +275,7 @@ msgstr ""
|
||||
"就越多。然而,由于预热阶段的 GPU 内存使用量可能与实际推理时不同(例如,由于 EP 负载不均),将 `--gpu-memory-"
|
||||
"utilization` 设置得过高可能导致实际推理时出现 OOM(内存不足)问题。默认值为 `0.9`。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:145
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:144
|
||||
msgid ""
|
||||
"`--enable-expert-parallel` indicates that EP is enabled. Note that vLLM "
|
||||
"does not support a mixed approach of ETP and EP; that is, MoE can either "
|
||||
@@ -284,7 +284,7 @@ msgstr ""
|
||||
"`--enable-expert-parallel` 表示启用了 EP。请注意,vLLM 不支持 ETP 和 EP 的混合方法;也就是说,MoE "
|
||||
"要么使用纯 EP,要么使用纯 TP。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:146
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:145
|
||||
msgid ""
|
||||
"`--no-enable-prefix-caching` indicates that prefix caching is disabled. "
|
||||
"To enable it, for mamba-like models Qwen3.5, set `--enable-prefix-"
|
||||
@@ -298,13 +298,13 @@ msgstr ""
|
||||
"的实现可能在调度时导致非常大的 block_size。例如,block_size 可能被调整为 2048,这意味着任何短于 2048 "
|
||||
"的前缀将永远不会被缓存。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:147
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:146
|
||||
msgid ""
|
||||
"`--quantization` \"ascend\" indicates that quantization is used. To "
|
||||
"disable quantization, remove this option."
|
||||
msgstr "`--quantization` \"ascend\" 表示使用了量化。要禁用量化,请移除此选项。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:148
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:147
|
||||
msgid ""
|
||||
"`--compilation-config` contains configurations related to the aclgraph "
|
||||
"graph mode. The most significant configurations are \"cudagraph_mode\" "
|
||||
@@ -319,7 +319,7 @@ msgstr ""
|
||||
"\"PIECEWISE\" 和 \"FULL_DECODE_ONLY\"。图模式主要用于降低算子调度的开销。目前推荐使用 "
|
||||
"\"FULL_DECODE_ONLY\"。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:150
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:149
|
||||
msgid ""
|
||||
"\"cudagraph_capture_sizes\": represents different levels of graph modes. "
|
||||
"The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. "
|
||||
@@ -332,123 +332,132 @@ msgstr ""
|
||||
"40,..., `--max-num-"
|
||||
"seqs`]。在图模式下,不同级别图的输入是固定的,级别之间的输入会自动填充到下一个级别。目前推荐使用默认设置。只有在某些场景下,才需要单独设置此参数以达到最佳性能。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:152
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:151
|
||||
msgid "Multi-node Deployment with MP (Recommended)"
|
||||
msgstr "使用 MP 的多节点部署(推荐)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:154
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:153
|
||||
msgid ""
|
||||
"Assume you have 2 Atlas 800 A2 nodes, and want to deploy the `Qwen3.5"
|
||||
"-397B-A17B-w8a8-mtp` model across multiple nodes."
|
||||
msgstr "假设您有 2 个 Atlas 800 A2 节点,并希望跨多个节点部署 `Qwen3.5-397B-A17B-w8a8-mtp` 模型。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:156
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:155
|
||||
msgid "Node 0"
|
||||
msgstr "节点 0"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:202
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:201
|
||||
msgid "Node1"
|
||||
msgstr "节点 1"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:252
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:251
|
||||
msgid ""
|
||||
"If the service starts successfully, the following information will be "
|
||||
"displayed on node 0:"
|
||||
msgstr "如果服务启动成功,节点 0 上将显示以下信息:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:263
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:262
|
||||
msgid "Multi-node Deployment with Ray"
|
||||
msgstr "使用 Ray 的多节点部署"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:265
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:264
|
||||
msgid "refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](../features/ray.md)."
|
||||
msgstr "请参考 [Ray 分布式 (Qwen/Qwen3-235B-A22B)](../features/ray.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:267
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:266
|
||||
msgid "Prefill-Decode Disaggregation"
|
||||
msgstr "预填充-解码解耦"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:269
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:268
|
||||
msgid ""
|
||||
"We recommend using Mooncake for deployment: "
|
||||
"[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)."
|
||||
msgstr "我们推荐使用 Mooncake 进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||
msgstr ""
|
||||
"我们推荐使用 Mooncake "
|
||||
"进行部署:[Mooncake](../features/pd_disaggregation_mooncake_multi_node.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:271
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:270
|
||||
msgid ""
|
||||
"Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 1P1D (3 "
|
||||
"nodes) to run Qwen3.5-397B-A17B."
|
||||
msgstr "以 Atlas 800 A3 (64G × 16) 为例,我们建议部署 1P1D(3 个节点)来运行 Qwen3.5-397B-A17B。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:273
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:272
|
||||
msgid "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` require 3 Atlas 800 A3 (64G × 16)."
|
||||
msgstr "`Qwen3.5-397B-A17B-w8a8-mtp 1P1D` 需要 3 个 Atlas 800 A3 (64G × 16)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:275
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:274
|
||||
msgid ""
|
||||
"To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need "
|
||||
"to deploy `run_p.sh` 、`run_d0.sh` and `run_d1.sh` script on each node and"
|
||||
" deploy a `proxy.sh` script on prefill master node to forward requests."
|
||||
msgstr "要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署 `run_p.sh`、`run_d0.sh` 和 `run_d1.sh` 脚本,并在预填充主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||
msgstr ""
|
||||
"要运行 vllm-ascend `Prefill-Decode Disaggregation` 服务,您需要在每个节点上部署 "
|
||||
"`run_p.sh`、`run_d0.sh` 和 `run_d1.sh` 脚本,并在预填充主节点上部署一个 `proxy.sh` 脚本来转发请求。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:277
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:276
|
||||
msgid "Prefill Node 0 `run_p.sh` script"
|
||||
msgstr "预填充节点 0 `run_p.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:352
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:350
|
||||
msgid "Decode Node 0 `run_d0.sh` script"
|
||||
msgstr "解码节点 0 `run_d0.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:432
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:430
|
||||
msgid "Decode Node 1 `run_d1.sh` script"
|
||||
msgstr "解码节点 1 `run_d1.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:519
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:517
|
||||
msgid "Run the `proxy.sh` script on the prefill master node"
|
||||
msgstr "在预填充主节点上运行 `proxy.sh` 脚本"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:521
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:519
|
||||
msgid ""
|
||||
"Run a proxy server on the same node with the prefiller service instance. "
|
||||
"You can get the proxy program in the repository's examples: "
|
||||
"[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-"
|
||||
"project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr "在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
msgstr ""
|
||||
"在与预填充服务实例相同的节点上运行一个代理服务器。您可以在仓库的示例中找到代理程序:[load\\_balance\\_proxy\\_server\\_example.py](https://github.com"
|
||||
"/vllm-project/vllm-"
|
||||
"ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:547
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:545
|
||||
msgid "Functional Verification"
|
||||
msgstr "功能验证"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:549
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:547
|
||||
msgid "Once your server is started, you can query the model with input prompts:"
|
||||
msgstr "服务器启动后,您可以使用输入提示词查询模型:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:562
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:560
|
||||
msgid "Accuracy Evaluation"
|
||||
msgstr "精度评估"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:564
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:562
|
||||
msgid "Here are two accuracy evaluation methods."
|
||||
msgstr "以下是两种精度评估方法。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:566
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:578
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:564
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:576
|
||||
msgid "Using AISBench"
|
||||
msgstr "使用 AISBench"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:568
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:566
|
||||
msgid ""
|
||||
"Refer to [Using "
|
||||
"AISBench](../../developer_guide/evaluation/using_ais_bench.md) for "
|
||||
"details."
|
||||
msgstr "详情请参阅[使用 AISBench](../../developer_guide/evaluation/using_ais_bench.md)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:570
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:568
|
||||
msgid ""
|
||||
"After execution, you can get the result, here is the result of `Qwen3.5"
|
||||
"-397B-A17B-w8a8` in `vllm-ascend:v0.17.0rc1` for reference only."
|
||||
msgstr "执行后,您可以获得结果,以下是 `vllm-ascend:v0.17.0rc1` 中 `Qwen3.5-397B-A17B-w8a8` 的结果,仅供参考。"
|
||||
msgstr ""
|
||||
"执行后,您可以获得结果,以下是 `vllm-ascend:v0.17.0rc1` 中 `Qwen3.5-397B-A17B-w8a8` "
|
||||
"的结果,仅供参考。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:76
|
||||
msgid "dataset"
|
||||
@@ -490,54 +499,74 @@ msgstr "生成"
|
||||
msgid "96.74"
|
||||
msgstr "96.74"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:576
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:574
|
||||
msgid "Performance"
|
||||
msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:580
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:578
|
||||
msgid ""
|
||||
"Refer to [Using AISBench for performance "
|
||||
"evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation) for details."
|
||||
msgstr "详情请参阅[使用 AISBench 进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation)。"
|
||||
msgstr ""
|
||||
"详情请参阅[使用 AISBench "
|
||||
"进行性能评估](../../developer_guide/evaluation/using_ais_bench.md#execute-"
|
||||
"performance-evaluation)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:582
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:580
|
||||
msgid "Using vLLM Benchmark"
|
||||
msgstr "使用 vLLM Benchmark"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:584
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:582
|
||||
msgid "Run performance evaluation of `Qwen3.5-397B-A17B-w8a8` as an example."
|
||||
msgstr "以运行 `Qwen3.5-397B-A17B-w8a8` 的性能评估为例。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:586
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:584
|
||||
msgid ""
|
||||
"Refer to [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) "
|
||||
"for more details."
|
||||
msgstr "更多详情请参阅 [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||
msgstr ""
|
||||
"更多详情请参阅 [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:588
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:586
|
||||
msgid "There are three `vllm bench` subcommands:"
|
||||
msgstr "`vllm bench` 有三个子命令:"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:590
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:588
|
||||
msgid "`latency`: Benchmark the latency of a single batch of requests."
|
||||
msgstr "`latency`:对单批请求的延迟进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:591
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:589
|
||||
msgid "`serve`: Benchmark the online serving throughput."
|
||||
msgstr "`serve`:对在线服务吞吐量进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:592
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:590
|
||||
msgid "`throughput`: Benchmark offline inference throughput."
|
||||
msgstr "`throughput`:对离线推理吞吐量进行基准测试。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:594
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:592
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
msgstr "以 `serve` 为例。运行代码如下。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:601
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:599
|
||||
msgid ""
|
||||
"After about several minutes, you can get the performance evaluation "
|
||||
"result."
|
||||
msgstr "大约几分钟后,您将获得性能评估结果。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:601
|
||||
msgid "Qwen3.5-397B-A17B Known issues"
|
||||
msgstr "Qwen3.5-397B-A17B 已知问题"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3.5-397B-A17B.md:603
|
||||
msgid ""
|
||||
"Issue1: For single-node deployment scenario, when fused_mc2 is enabled, "
|
||||
"using multi-DP model deployment may cause garbled or empty outputs after "
|
||||
"the model triggers recomputation.When tuning performance by adjusting "
|
||||
"model parallelism, ensure that this fused operator is disabled when DP > "
|
||||
"1. For PD deployment scenario,D nodes can avoid this problem by enabling "
|
||||
"the recompute scheduler."
|
||||
msgstr ""
|
||||
"问题1:在单节点部署场景下,当启用 fused_mc2 时,使用多 DP 模型部署可能会导致模型触发重计算后输出乱码或为空。在通过调整模型并行度来调优性能时,请确保当 DP > 1 时禁用此融合算子。对于 PD 部署场景,D 节点可以通过启用重计算调度器来避免此问题。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -37,7 +37,9 @@ msgid ""
|
||||
"model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of "
|
||||
"vLLM Ascend support the model."
|
||||
msgstr ""
|
||||
"Qwen3 Embedding 模型系列是 Qwen 家族最新的专有模型,专为文本嵌入和排序任务设计。它基于 Qwen3 系列的稠密基础模型,提供了多种尺寸(0.6B、4B 和 8B)的全面文本嵌入和重排序模型。本指南描述了如何使用 vLLM Ascend 运行该模型。请注意,只有 vLLM Ascend 0.9.2rc1 及更高版本支持此模型。"
|
||||
"Qwen3 Embedding 模型系列是 Qwen 家族最新的专有模型,专为文本嵌入和排序任务设计。它基于 Qwen3 "
|
||||
"系列的稠密基础模型,提供了多种尺寸(0.6B、4B 和 8B)的全面文本嵌入和重排序模型。本指南描述了如何使用 vLLM Ascend "
|
||||
"运行该模型。请注意,只有 vLLM Ascend 0.9.2rc1 及更高版本支持此模型。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:7
|
||||
msgid "Supported Features"
|
||||
@@ -62,19 +64,25 @@ msgstr "模型权重"
|
||||
msgid ""
|
||||
"`Qwen3-Embedding-8B` [Download model "
|
||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-8B)"
|
||||
msgstr "`Qwen3-Embedding-8B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-8B)"
|
||||
msgstr ""
|
||||
"`Qwen3-Embedding-8B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3"
|
||||
"-Embedding-8B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:16
|
||||
msgid ""
|
||||
"`Qwen3-Embedding-4B` [Download model "
|
||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-4B)"
|
||||
msgstr "`Qwen3-Embedding-4B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-4B)"
|
||||
msgstr ""
|
||||
"`Qwen3-Embedding-4B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3"
|
||||
"-Embedding-4B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:17
|
||||
msgid ""
|
||||
"`Qwen3-Embedding-0.6B` [Download model "
|
||||
"weight](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)"
|
||||
msgstr "`Qwen3-Embedding-0.6B` [下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)"
|
||||
msgstr ""
|
||||
"`Qwen3-Embedding-0.6B` "
|
||||
"[下载模型权重](https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:19
|
||||
msgid ""
|
||||
@@ -96,7 +104,9 @@ msgstr "您可以使用我们的官方 docker 镜像来运行 `Qwen3-Embedding`
|
||||
msgid ""
|
||||
"Start the docker image on your node, refer to [using "
|
||||
"docker](../../installation.md#set-up-using-docker)."
|
||||
msgstr "在您的节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-docker)。"
|
||||
msgstr ""
|
||||
"在您的节点上启动 docker 镜像,请参考[使用 docker](../../installation.md#set-up-using-"
|
||||
"docker)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:27
|
||||
msgid ""
|
||||
@@ -142,10 +152,12 @@ msgstr "性能"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:98
|
||||
msgid ""
|
||||
"Run performance of `Qwen3-Reranker-8B` as an example. Refer to [vllm "
|
||||
"Run performance of `Qwen3-Embedding-8B` as an example. Refer to [vllm "
|
||||
"benchmark](https://docs.vllm.ai/en/latest/contributing/) for more "
|
||||
"details."
|
||||
msgstr "以 `Qwen3-Reranker-8B` 的运行性能为例。更多详情请参考 [vllm 基准测试](https://docs.vllm.ai/en/latest/contributing/)。"
|
||||
msgstr ""
|
||||
"以 `Qwen3-Embedding-8B` 的运行性能为例。更多详情请参考 [vllm "
|
||||
"基准测试](https://docs.vllm.ai/en/latest/contributing/)。"
|
||||
|
||||
#: ../../source/tutorials/models/Qwen3_embedding.md:101
|
||||
msgid "Take the `serve` as an example. Run the code as follows."
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -119,6 +119,10 @@ msgstr "昇腾编译的配置选项"
|
||||
msgid "`eplb_config`"
|
||||
msgstr "`eplb_config`"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "Configuration options for eplb"
|
||||
msgstr "eplb 的配置选项"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "`refresh`"
|
||||
msgstr "`refresh`"
|
||||
@@ -203,8 +207,12 @@ msgid "`recompute_scheduler_enable`"
|
||||
msgstr "`recompute_scheduler_enable`"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "Whether to enable recompute scheduler."
|
||||
msgstr "是否启用重计算调度器。"
|
||||
msgid ""
|
||||
"Whether to enable the recompute scheduler. **Only valid in PD-"
|
||||
"disaggregated mode** (`kv_role` is `kv_producer` or `kv_consumer`). **Do "
|
||||
"not enable in PD-mixed mode** (no `kv_transfer_config`, or `kv_role` is "
|
||||
"`kv_both`); startup will fail with a clear error."
|
||||
msgstr "是否启用重计算调度器。**仅在 PD 解耦模式下有效**(`kv_role` 为 `kv_producer` 或 `kv_consumer`)。**请勿在 PD 混合模式下启用**(无 `kv_transfer_config`,或 `kv_role` 为 `kv_both`);启动时将失败并显示明确的错误信息。"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "`enable_cpu_binding`"
|
||||
@@ -347,7 +355,9 @@ msgstr "`prefetch_ratio`"
|
||||
msgid ""
|
||||
"`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, "
|
||||
"\"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}`"
|
||||
msgstr "`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, \"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}`"
|
||||
msgstr ""
|
||||
"`{\"attn\": {\"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}, "
|
||||
"\"mlp\": { \"gate_up\": 1.0, \"down\": 1.0}}`"
|
||||
|
||||
#: ../../source/user_guide/configuration/additional_config.md
|
||||
msgid "Prefetch ratio of each weight."
|
||||
@@ -519,3 +529,6 @@ msgstr "示例"
|
||||
#: ../../source/user_guide/configuration/additional_config.md:99
|
||||
msgid "An example of additional configuration is as follows:"
|
||||
msgstr "以下是额外配置的一个示例:"
|
||||
|
||||
#~ msgid "Whether to enable recompute scheduler."
|
||||
#~ msgstr "是否启用重计算调度器。"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-15 09:41+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -230,12 +230,12 @@ msgstr "实验结果"
|
||||
msgid ""
|
||||
"To evaluate the effectiveness of fine-grained TP in large-scale service "
|
||||
"scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated "
|
||||
"decode instances in an environment of 32 cards Ascend 910B*64G (A2), with"
|
||||
" parallel configuration as DP32+EP32, and fine-grained TP size of 8; the "
|
||||
"performance data is as follows."
|
||||
"decode instances in an environment of 32 cards Ascend Atlas A2 inference "
|
||||
"products*64G (A2), with parallel configuration as DP32+EP32, and fine-"
|
||||
"grained TP size of 8; the performance data is as follows."
|
||||
msgstr ""
|
||||
"为评估细粒度 TP 在大规模服务场景中的有效性,我们使用模型 **DeepSeek-R1-W8A8**,在 32 卡 Ascend "
|
||||
"910B*64G (A2) 环境中部署 PD 分离的解码实例,并行配置为 DP32+EP32,细粒度 TP 规模为 8;性能数据如下。"
|
||||
"Atlas A2 推理产品*64G (A2) 环境中部署 PD 分离的解码实例,并行配置为 DP32+EP32,细粒度 TP 规模为 8;性能数据如下。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/Fine_grained_TP.md
|
||||
msgid "Module"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,16 +29,15 @@ msgid ""
|
||||
"active development. Track progress and planned improvements at "
|
||||
"<https://github.com/vllm-project/vllm-ascend/issues/5487>"
|
||||
msgstr ""
|
||||
"批次不变性功能目前处于测试阶段。部分功能仍在积极开发中。请通过 "
|
||||
"<https://github.com/vllm-project/vllm-ascend/issues/5487> 跟踪进展和计划改进。"
|
||||
"批次不变性功能目前处于测试阶段。部分功能仍在积极开发中。请通过 <https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/5487> 跟踪进展和计划改进。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:8
|
||||
msgid ""
|
||||
"This document shows how to enable batch invariance in vLLM-Ascend. Batch "
|
||||
"invariance ensures that the output of a model is deterministic and "
|
||||
"independent of the batch size or the order of requests in a batch."
|
||||
msgstr ""
|
||||
"本文档介绍如何在 vLLM-Ascend 中启用批次不变性。批次不变性确保模型的输出是确定性的,且不依赖于批次大小或批次中请求的顺序。"
|
||||
msgstr "本文档介绍如何在 vLLM-Ascend 中启用批次不变性。批次不变性确保模型的输出是确定性的,且不依赖于批次大小或批次中请求的顺序。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:10
|
||||
msgid "Motivation"
|
||||
@@ -53,8 +52,7 @@ msgid ""
|
||||
"**Framework debugging**: Deterministic outputs make it easier to debug "
|
||||
"issues in the inference framework, as the same input will always produce "
|
||||
"the same output regardless of batching."
|
||||
msgstr ""
|
||||
"**框架调试**:确定性输出使得调试推理框架中的问题更加容易,因为无论批处理方式如何,相同的输入总是产生相同的输出。"
|
||||
msgstr "**框架调试**:确定性输出使得调试推理框架中的问题更加容易,因为无论批处理方式如何,相同的输入总是产生相同的输出。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:15
|
||||
msgid ""
|
||||
@@ -81,11 +79,11 @@ msgstr "硬件要求"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:21
|
||||
msgid ""
|
||||
"Batch invariance currently requires Ascend 910B NPUs, because only the "
|
||||
"910B supports batch invariance with HCCL communication for now. We will "
|
||||
"support other NPUs in the future."
|
||||
msgstr ""
|
||||
"批次不变性目前需要 Ascend 910B NPU,因为目前只有 910B 支持通过 HCCL 通信实现批次不变性。我们未来将支持其他 NPU。"
|
||||
"Batch invariance currently requires Ascend Atlas A2 inference products "
|
||||
"NPUs, because only the Atlas A2 inference products supports batch "
|
||||
"invariance with HCCL communication for now. We will support other NPUs in"
|
||||
" the future."
|
||||
msgstr "批次不变性目前需要 Ascend Atlas A2 推理产品 NPU,因为目前只有 Atlas A2 推理产品支持通过 HCCL 通信实现批次不变性。我们未来将支持其他 NPU。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:24
|
||||
msgid "Software Requirements"
|
||||
@@ -93,9 +91,10 @@ msgstr "软件要求"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:26
|
||||
msgid ""
|
||||
"Batch invariance requires a customed operator library for 910B. We will "
|
||||
"release the customed operator library in future versions."
|
||||
msgstr "批次不变性需要为 910B 定制的算子库。我们将在未来版本中发布该定制算子库。"
|
||||
"Batch invariance requires a custom operator library for Atlas A2 "
|
||||
"inference products. We will release the customed operator library in "
|
||||
"future versions."
|
||||
msgstr "批次不变性需要为 Atlas A2 推理产品定制的算子库。我们将在未来版本中发布该定制算子库。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:29
|
||||
msgid "Enabling Batch Invariance"
|
||||
@@ -150,7 +149,9 @@ msgid ""
|
||||
"[GitHub issue tracker](https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/new/choose)."
|
||||
msgstr ""
|
||||
"其他模型也可能适用,但上述模型已明确经过验证。如果您在使用特定模型时遇到问题,请在 [GitHub 问题跟踪器](https://github.com/vllm-project/vllm-ascend/issues/new/choose) 上报告。"
|
||||
"其他模型也可能适用,但上述模型已明确经过验证。如果您在使用特定模型时遇到问题,请在 [GitHub "
|
||||
"问题跟踪器](https://github.com/vllm-project/vllm-ascend/issues/new/choose) "
|
||||
"上报告。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/batch_invariance.md:114
|
||||
msgid "Implementation Details"
|
||||
@@ -211,4 +212,6 @@ msgstr "额外的测试和验证"
|
||||
msgid ""
|
||||
"For the latest status and to contribute ideas, see the [tracking "
|
||||
"issue](https://github.com/vllm-project/vllm-ascend/issues/5487)."
|
||||
msgstr "有关最新状态和贡献想法,请参阅 [跟踪问题](https://github.com/vllm-project/vllm-ascend/issues/5487)。"
|
||||
msgstr ""
|
||||
"有关最新状态和贡献想法,请参阅 [跟踪问题](https://github.com/vllm-project/vllm-"
|
||||
"ascend/issues/5487)。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -34,7 +34,8 @@ msgid ""
|
||||
"Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory "
|
||||
"usage and improves inference speed in long sequence LLM inference."
|
||||
msgstr ""
|
||||
"本指南介绍如何使用上下文并行(Context Parallel),一种长序列推理优化技术。上下文并行包括 `PCP`(预填充上下文并行)和 `DCP`(解码上下文并行),可减少长序列LLM推理中的NPU内存使用并提升推理速度。"
|
||||
"本指南介绍如何使用上下文并行(Context Parallel),一种长序列推理优化技术。上下文并行包括 `PCP`(预填充上下文并行)和 "
|
||||
"`DCP`(解码上下文并行),可减少长序列LLM推理中的NPU内存使用并提升推理速度。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:7
|
||||
msgid "Benefits of Context Parallel"
|
||||
@@ -47,32 +48,28 @@ msgid ""
|
||||
"and have quite different SLO (service level objectives), we need to "
|
||||
"implement context parallel separately for them. The major considerations "
|
||||
"are:"
|
||||
msgstr ""
|
||||
"上下文并行主要解决服务长上下文请求的问题。由于预填充和解码阶段具有截然不同的特性以及不同的服务级别目标(SLO),我们需要分别为它们实现上下文并行。主要考虑点如下:"
|
||||
msgstr "上下文并行主要解决服务长上下文请求的问题。由于预填充和解码阶段具有截然不同的特性以及不同的服务级别目标(SLO),我们需要分别为它们实现上下文并行。主要考虑点如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:11
|
||||
msgid ""
|
||||
"For long context prefill, we can use context parallel to reduce TTFT "
|
||||
"(time to first token) by amortizing the computation time of the prefill "
|
||||
"across query tokens."
|
||||
msgstr ""
|
||||
"对于长上下文预填充,我们可以使用上下文并行,通过将预填充的计算时间分摊到查询令牌上,从而减少首令牌时间(TTFT)。"
|
||||
msgstr "对于长上下文预填充,我们可以使用上下文并行,通过将预填充的计算时间分摊到查询令牌上,从而减少首令牌时间(TTFT)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:12
|
||||
msgid ""
|
||||
"For long context decode, we can use context parallel to reduce KV cache "
|
||||
"duplication and offer more space for KV cache to increase the batch size "
|
||||
"(and hence the throughput)."
|
||||
msgstr ""
|
||||
"对于长上下文解码,我们可以使用上下文并行来减少KV缓存的重复存储,为KV缓存提供更多空间,从而增加批处理大小(进而提升吞吐量)。"
|
||||
msgstr "对于长上下文解码,我们可以使用上下文并行来减少KV缓存的重复存储,为KV缓存提供更多空间,从而增加批处理大小(进而提升吞吐量)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:14
|
||||
msgid ""
|
||||
"To learn more about the theory and implementation details of context "
|
||||
"parallel, please refer to the [context parallel developer "
|
||||
"guide](../../developer_guide/Design_Documents/context_parallel.md)."
|
||||
msgstr ""
|
||||
"要了解更多关于上下文并行的理论和实现细节,请参阅[上下文并行开发者指南](../../developer_guide/Design_Documents/context_parallel.md)。"
|
||||
msgstr "要了解更多关于上下文并行的理论和实现细节,请参阅[上下文并行开发者指南](../../developer_guide/Design_Documents/context_parallel.md)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:16
|
||||
msgid "Supported Scenarios"
|
||||
@@ -132,7 +129,9 @@ msgstr "如何使用上下文并行"
|
||||
msgid ""
|
||||
"You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and "
|
||||
"`decode_context_parallel_size`, refer to the following example:"
|
||||
msgstr "您可以通过 `prefill_context_parallel_size` 和 `decode_context_parallel_size` 启用 `PCP` 和 `DCP`,请参考以下示例:"
|
||||
msgstr ""
|
||||
"您可以通过 `prefill_context_parallel_size` 和 `decode_context_parallel_size` 启用"
|
||||
" `PCP` 和 `DCP`,请参考以下示例:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:29
|
||||
msgid "Offline example:"
|
||||
@@ -147,7 +146,9 @@ msgid ""
|
||||
"The total world size is `tensor_parallel_size` * "
|
||||
"`prefill_context_parallel_size`, so the examples above need 4 NPUs for "
|
||||
"each."
|
||||
msgstr "总的世界大小为 `tensor_parallel_size` * `prefill_context_parallel_size`,因此上述示例各需要4个NPU。"
|
||||
msgstr ""
|
||||
"总的世界大小为 `tensor_parallel_size` * "
|
||||
"`prefill_context_parallel_size`,因此上述示例各需要4个NPU。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:59
|
||||
msgid "Constraints"
|
||||
@@ -178,14 +179,18 @@ msgstr "对于基于GQA的模型,例如Qwen3-235B:"
|
||||
msgid ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) >= "
|
||||
"decode_context_parallel_size`"
|
||||
msgstr "`(tensor_parallel_size // num_key_value_heads) >= decode_context_parallel_size`"
|
||||
msgstr ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) >= "
|
||||
"decode_context_parallel_size`"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:67
|
||||
#, python-format
|
||||
msgid ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) % "
|
||||
"decode_context_parallel_size == 0`"
|
||||
msgstr "`(tensor_parallel_size // num_key_value_heads) % decode_context_parallel_size == 0`"
|
||||
msgstr ""
|
||||
"`(tensor_parallel_size // num_key_value_heads) % "
|
||||
"decode_context_parallel_size == 0`"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:69
|
||||
msgid ""
|
||||
@@ -195,7 +200,9 @@ msgid ""
|
||||
"`block_size`(default: 128), which specifies CP to split KV cache in a "
|
||||
"block-interleave style. For example:"
|
||||
msgstr ""
|
||||
"在需要KV缓存传输的场景(例如KV池化、PD解耦)中使用上下文并行时,为简化KV缓存传输,必须将 `cp_kv_cache_interleave_size` 设置为与KV缓存 `block_size`(默认:128)相同的值,这指定了CP以块交错方式分割KV缓存。例如:"
|
||||
"在需要KV缓存传输的场景(例如KV池化、PD解耦)中使用上下文并行时,为简化KV缓存传输,必须将 "
|
||||
"`cp_kv_cache_interleave_size` 设置为与KV缓存 "
|
||||
"`block_size`(默认:128)相同的值,这指定了CP以块交错方式分割KV缓存。例如:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:80
|
||||
msgid "Experimental Results"
|
||||
@@ -206,9 +213,11 @@ msgid ""
|
||||
"To evaluate the effectiveness of Context Parallel in long sequence LLM "
|
||||
"inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, "
|
||||
"deploy PD disaggregate instances in the environment of 64 cards Ascend "
|
||||
"910C*64G (A3), the configuration and performance data are as follows."
|
||||
"Atlas A3 inference products*64G (A3), the configuration and performance "
|
||||
"data are as follows."
|
||||
msgstr ""
|
||||
"为评估上下文并行在长序列LLM推理场景中的有效性,我们使用 **DeepSeek-R1-W8A8** 和 **Qwen3-235B**,在64卡Ascend 910C*64G(A3)环境中部署PD解耦实例,配置和性能数据如下。"
|
||||
"为评估上下文并行在长序列LLM推理场景中的有效性,我们使用 **DeepSeek-R1-W8A8** 和 "
|
||||
"**Qwen3-235B**,在64卡Ascend Atlas A3推理产品*64G(A3)环境中部署PD解耦实例,配置和性能数据如下。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/context_parallel.md:84
|
||||
msgid "DeepSeek-R1-W8A8:"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -36,7 +36,9 @@ msgid ""
|
||||
"latency. This feature only adjusts host-side CPU affinity policies and "
|
||||
"**does not alter model execution logic or impact inference results**."
|
||||
msgstr ""
|
||||
"CPU 绑定是 vLLM 的一项性能优化功能,专为配备 **ARM 架构和昇腾 NPU** 的服务器设计。它将 vLLM 进程和线程固定到特定的 CPU 核心,以减少 CPU-NPU 跨 NUMA 通信开销并稳定推理延迟。此功能仅调整主机端的 CPU 亲和性策略,**不会改变模型执行逻辑或影响推理结果**。"
|
||||
"CPU 绑定是 vLLM 的一项性能优化功能,专为配备 **ARM 架构和昇腾 NPU** 的服务器设计。它将 vLLM 进程和线程固定到特定的 "
|
||||
"CPU 核心,以减少 CPU-NPU 跨 NUMA 通信开销并稳定推理延迟。此功能仅调整主机端的 CPU "
|
||||
"亲和性策略,**不会改变模型执行逻辑或影响推理结果**。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:7
|
||||
msgid "Usage"
|
||||
@@ -84,12 +86,13 @@ msgstr "IRQ 绑定的额外注意事项"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:70
|
||||
msgid ""
|
||||
"For best results, if you run inside a docker container, which `systemctl`"
|
||||
" is likely unavailable, stop `irqbalance` service on the host manually "
|
||||
"before starting vLLM. Also make sure the container has the necessary "
|
||||
"For best results, if you run inside a Docker container where `systemctl` "
|
||||
"is likely unavailable, stop the `irqbalance` service on the host manually"
|
||||
" before starting vLLM. Also make sure the container has the necessary "
|
||||
"permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:"
|
||||
msgstr ""
|
||||
"为获得最佳效果,如果您在 Docker 容器内运行(容器内可能没有 `systemctl`),请在启动 vLLM 前手动在主机上停止 `irqbalance` 服务。同时确保容器具有写入 `/proc/irq/*/smp_affinity` 以进行 IRQ 绑定所需的权限:"
|
||||
"为获得最佳效果,如果您在 Docker 容器内运行(容器内可能没有 `systemctl`),请在启动 vLLM 前手动在主机上停止 "
|
||||
"`irqbalance` 服务。同时确保容器具有写入 `/proc/irq/*/smp_affinity` 以进行 IRQ 绑定所需的权限:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:72
|
||||
msgid "**Stop `irqbalance` service**:"
|
||||
@@ -192,7 +195,9 @@ msgid ""
|
||||
"1. Confirm that required tools (taskset, lscpu, npu-smi) are installed "
|
||||
"and available; 2. Verify the Cpus_allowed_list in `/proc/self/status` is "
|
||||
"valid."
|
||||
msgstr "1. 确认所需工具(taskset, lscpu, npu-smi)已安装且可用;2. 验证 `/proc/self/status` 中的 Cpus_allowed_list 是有效的。"
|
||||
msgstr ""
|
||||
"1. 确认所需工具(taskset, lscpu, npu-smi)已安装且可用;2. 验证 `/proc/self/status` 中的 "
|
||||
"Cpus_allowed_list 是有效的。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:98
|
||||
msgid "Key Limitations"
|
||||
@@ -268,13 +273,17 @@ msgid ""
|
||||
"processes and threads to specific CPU cores, thereby stabilizing "
|
||||
"inference latency in Ascend NPU deployments (only applicable to ARM "
|
||||
"architectures)."
|
||||
msgstr "**核心目标**:通过将 vLLM 进程和线程固定到特定的 CPU 核心来减少跨 NUMA 通信,从而稳定昇腾 NPU 部署中的推理延迟(仅适用于 ARM 架构)。"
|
||||
msgstr ""
|
||||
"**核心目标**:通过将 vLLM 进程和线程固定到特定的 CPU 核心来减少跨 NUMA 通信,从而稳定昇腾 NPU 部署中的推理延迟(仅适用于"
|
||||
" ARM 架构)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:130
|
||||
msgid ""
|
||||
"**Usage**: Enable or disable with `enable_cpu_binding` via "
|
||||
"`additional_config` in both online and offline workflows."
|
||||
msgstr "**使用方法**:在在线和离线工作流中,通过 `additional_config` 中的 `enable_cpu_binding` 参数启用或禁用。"
|
||||
msgstr ""
|
||||
"**使用方法**:在在线和离线工作流中,通过 `additional_config` 中的 `enable_cpu_binding` "
|
||||
"参数启用或禁用。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/cpu_binding.md:132
|
||||
msgid ""
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,8 +29,7 @@ msgid ""
|
||||
"during each inference iteration within the chunked prefilling strategy "
|
||||
"according to the resources and SLO targets, thereby improving the "
|
||||
"effective throughput and decreasing the TBT."
|
||||
msgstr ""
|
||||
"动态批处理是一种技术,它根据资源和SLO目标,在分块预填充策略的每次推理迭代中动态调整块大小,从而提高有效吞吐量并降低TBT。"
|
||||
msgstr "动态批处理是一种技术,它根据资源和SLO目标,在分块预填充策略的每次推理迭代中动态调整块大小,从而提高有效吞吐量并降低TBT。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:5
|
||||
msgid ""
|
||||
@@ -41,7 +40,8 @@ msgid ""
|
||||
"further improvements and this feature will support more XPUs in the "
|
||||
"future."
|
||||
msgstr ""
|
||||
"动态批处理由 `--SLO_limits_for_dynamic_batch` 参数的值控制。值得注意的是,目前仅支持910 B3,且解码token数量规模需低于2048。特别是在Qwen、Llama模型上,改进效果相当明显。我们正在进行进一步的改进,该功能未来将支持更多XPU。"
|
||||
"动态批处理由 `--SLO_limits_for_dynamic_batch` 参数的值控制。值得注意的是,目前仅支持910 "
|
||||
"B3,且解码token数量规模需低于2048。特别是在Qwen、Llama模型上,改进效果相当明显。我们正在进行进一步的改进,该功能未来将支持更多XPU。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:10
|
||||
msgid "Getting started"
|
||||
@@ -60,7 +60,10 @@ msgid ""
|
||||
"ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to "
|
||||
"the path `vllm_ascend/core/profile_table.csv`"
|
||||
msgstr ""
|
||||
"动态批处理目前依赖于一个保存在查找表中的离线成本模型来优化token预算。该查找表保存在一个'.csv'文件中,需要先从[A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv)下载,重命名后保存到路径 `vllm_ascend/core/profile_table.csv`。"
|
||||
"动态批处理目前依赖于一个保存在查找表中的离线成本模型来优化token预算。该查找表保存在一个'.csv'文件中,需要先从[A2-B3-BLK128.csv](https"
|
||||
"://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-"
|
||||
"ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv)下载,重命名后保存到路径 "
|
||||
"`vllm_ascend/core/profile_table.csv`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:16
|
||||
msgid ""
|
||||
@@ -75,12 +78,12 @@ msgstr "调优参数"
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:24
|
||||
msgid ""
|
||||
"`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) "
|
||||
"for the dynamic batch feature, larger values impose more constraints on "
|
||||
"the latency limitation, leading to higher effective throughput. The "
|
||||
"parameter can be selected according to the specific models or service "
|
||||
"requirements."
|
||||
"for the dynamic batch feature, larger values relax latency limitation, "
|
||||
"leading to higher effective throughput. The parameter can be selected "
|
||||
"according to the specific models or service requirements."
|
||||
msgstr ""
|
||||
"`--SLO_limits_for_dynamic_batch` 是动态批处理功能的调优参数(整数类型),较大的值会对延迟限制施加更多约束,从而带来更高的有效吞吐量。可以根据具体模型或服务需求选择该参数。"
|
||||
"`--SLO_limits_for_dynamic_batch` "
|
||||
"是动态批处理功能的调优参数(整数类型),较大的值会放宽延迟限制,从而带来更高的有效吞吐量。可以根据具体模型或服务需求选择该参数。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:32
|
||||
msgid "Supported Models"
|
||||
@@ -95,7 +98,9 @@ msgid ""
|
||||
"75`. Therefore, some additional tests are needed to select the best "
|
||||
"parameter."
|
||||
msgstr ""
|
||||
"目前,动态批处理在几个密集模型上表现更好,包括Qwen和Llama(从8B到32B),且 `tensor_parallel_size=8`。对于不同的模型,需要一个合适的 `SLO_limits_for_dynamic_batch` 参数。该参数的经验值通常是 `35、50或75`。因此,需要进行一些额外的测试来选择最佳参数。"
|
||||
"目前,动态批处理在几个密集模型上表现更好,包括Qwen和Llama(从8B到32B),且 "
|
||||
"`tensor_parallel_size=8`。对于不同的模型,需要一个合适的 `SLO_limits_for_dynamic_batch` "
|
||||
"参数。该参数的经验值通常是 `35、50或75`。因此,需要进行一些额外的测试来选择最佳参数。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/dynamic_batch.md:36
|
||||
msgid "Usage"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,14 +29,14 @@ msgstr "概述"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:5
|
||||
msgid ""
|
||||
"Expert balancing for MoE models in LLM serving is essential for optimal "
|
||||
"performance. Dynamically changing experts during inference can negatively"
|
||||
" impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due "
|
||||
"to stop-the-world operations. SwiftBalancer enables asynchronous expert "
|
||||
"load balancing with zero-overhead expert movement, ensuring seamless "
|
||||
"service continuity."
|
||||
"Expert balancing for MoE (Mixture of Experts) models in LLM (Large "
|
||||
"Language) serving is essential for optimal performance. Dynamically "
|
||||
"changing experts during inference can negatively impact TTFT (Time To "
|
||||
"First Token) and TPOT (Time Per Output Token) due to stop-the-world "
|
||||
"operations. SwiftBalancer enables asynchronous expert load balancing with"
|
||||
" zero-overhead expert movement, ensuring seamless service continuity."
|
||||
msgstr ""
|
||||
"在LLM服务中,MoE模型的专家均衡对于实现最佳性能至关重要。推理过程中动态改变专家会因全局暂停操作而对TTFT(首词元时间)和TPOT(每输出词元时间)产生负面影响。SwiftBalancer支持异步专家负载均衡,实现零开销的专家迁移,确保服务无缝连续。"
|
||||
"在LLM(大语言模型)服务中,MoE(混合专家)模型的专家均衡对于实现最佳性能至关重要。推理过程中动态改变专家会因全局暂停操作而对TTFT(首词元时间)和TPOT(每输出词元时间)产生负面影响。SwiftBalancer支持异步专家负载均衡,实现零开销的专家迁移,确保服务无缝连续。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:7
|
||||
msgid "EPLB Effects"
|
||||
@@ -107,7 +107,9 @@ msgid ""
|
||||
"Adjust expert_heat_collection_interval and algorithm_execution_interval "
|
||||
"based on workload patterns."
|
||||
msgstr ""
|
||||
"我们需要添加环境变量 `export DYNAMIC_EPLB=\"true\"` 来启用vLLM EPLB。启用具有自动调优参数的动态均衡。根据工作负载模式调整 expert_heat_collection_interval 和 algorithm_execution_interval。"
|
||||
"我们需要添加环境变量 `export DYNAMIC_EPLB=\"true\"` 来启用vLLM "
|
||||
"EPLB。启用具有自动调优参数的动态均衡。根据工作负载模式调整 expert_heat_collection_interval 和 "
|
||||
"algorithm_execution_interval。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:42
|
||||
msgid "Static EPLB"
|
||||
@@ -124,7 +126,8 @@ msgid ""
|
||||
"expert_map_record_path. This creates a baseline configuration for future "
|
||||
"deployments."
|
||||
msgstr ""
|
||||
"我们需要添加环境变量 `export EXPERT_MAP_RECORD=\"true\"` 来记录专家映射。使用 expert_map_record_path 生成初始专家分布映射。这将为未来的部署创建一个基线配置。"
|
||||
"我们需要添加环境变量 `export EXPERT_MAP_RECORD=\"true\"` 来记录专家映射。使用 "
|
||||
"expert_map_record_path 生成初始专家分布映射。这将为未来的部署创建一个基线配置。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/eplb_swift_balancer.md:60
|
||||
msgid "Subsequent Deployments (Use Recorded Map)"
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -141,10 +141,6 @@ msgstr "为保证哈希生成的一致性,启用 KV Pool 时,需要在所有
|
||||
msgid "Example of using Mooncake as a KV Pool backend"
|
||||
msgstr "使用 Mooncake 作为 KV Pool 后端的示例"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:35
|
||||
msgid "Software:"
|
||||
msgstr "软件:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:36
|
||||
msgid "Check NPU HCCN Configuration:"
|
||||
msgstr "检查 NPU HCCN 配置:"
|
||||
@@ -167,9 +163,9 @@ msgid ""
|
||||
"binaries>. First, we need to obtain the Mooncake project. Refer to the "
|
||||
"following command:"
|
||||
msgstr ""
|
||||
"Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 安装与编译指南:"
|
||||
"<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-"
|
||||
"binaries>。 首先,我们需要获取 Mooncake 项目。参考以下命令:"
|
||||
"Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 "
|
||||
"安装与编译指南:<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-"
|
||||
"and-use-binaries>。 首先,我们需要获取 Mooncake 项目。参考以下命令:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:54
|
||||
msgid "(Optional) Replace go install url if the network is poor"
|
||||
@@ -266,7 +262,7 @@ msgid "`export HCCL_INTRA_ROCE_ENABLE=1`"
|
||||
msgstr "`export HCCL_INTRA_ROCE_ENABLE=1`"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md
|
||||
msgid "Required by direct transmission cheme on 800 I/T A2 series"
|
||||
msgid "Required by direct transmission scheme on 800 I/T A2 series"
|
||||
msgstr "800 I/T A2 系列直传方案所需"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:102
|
||||
@@ -280,8 +276,8 @@ msgid ""
|
||||
"(ascend_direct), see: "
|
||||
"<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
|
||||
msgstr ""
|
||||
"关于 HIXL (ascend_direct) 的常见故障排除和问题定位指南,请参阅:"
|
||||
"<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
|
||||
"关于 HIXL (ascend_direct) "
|
||||
"的常见故障排除和问题定位指南,请参阅:<https://gitcode.com/cann/hixl/wiki/HIXL%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E5%AE%9A%E4%BD%8D%E6%89%8B%E5%86%8C.md>"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:107
|
||||
msgid "Run Mooncake Master"
|
||||
@@ -305,9 +301,9 @@ msgid ""
|
||||
"service. **global_segment_size**: Registered memory size per card to "
|
||||
"the KV Pool. **Needs to be aligned to 1GB.**"
|
||||
msgstr ""
|
||||
"**metadata_server**: 配置为 **P2PHANDSHAKE**。 **protocol:** 在 NPU 上必须设置为 'Ascend'。"
|
||||
"**device_name**: \"\" **master_server_address**: 配置 master 服务的 IP 和端口。 "
|
||||
"**global_segment_size**: 每张卡注册到 KV Pool 的内存大小。**需要对齐到 1GB。**"
|
||||
"**metadata_server**: 配置为 **P2PHANDSHAKE**。 **protocol:** 在 NPU 上必须设置为 "
|
||||
"'Ascend'。**device_name**: \"\" **master_server_address**: 配置 master 服务的"
|
||||
" IP 和端口。 **global_segment_size**: 每张卡注册到 KV Pool 的内存大小。**需要对齐到 1GB。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:129
|
||||
msgid "2.Start mooncake_master"
|
||||
@@ -326,8 +322,9 @@ msgid ""
|
||||
"`--default_kv_lease_ttl` and keep it larger than `ASCEND_CONNECT_TIMEOUT`"
|
||||
" and `ASCEND_TRANSFER_TIMEOUT`."
|
||||
msgstr ""
|
||||
"`eviction_high_watermark_ratio` 决定了 Mooncake Store 执行淘汰的水位线,`eviction_ratio` 决定了将被淘汰的存储对象比例。"
|
||||
"`default_kv_lease_ttl` 控制 KV 对象的默认租约 TTL(毫秒);通过 `--default_kv_lease_ttl` 配置,并保持其大于 "
|
||||
"`eviction_high_watermark_ratio` 决定了 Mooncake Store "
|
||||
"执行淘汰的水位线,`eviction_ratio` 决定了将被淘汰的存储对象比例。`default_kv_lease_ttl` 控制 KV "
|
||||
"对象的默认租约 TTL(毫秒);通过 `--default_kv_lease_ttl` 配置,并保持其大于 "
|
||||
"`ASCEND_CONNECT_TIMEOUT` 和 `ASCEND_TRANSFER_TIMEOUT`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:140
|
||||
@@ -347,8 +344,9 @@ msgid ""
|
||||
"performs kv_transfer, while `AscendStoreConnector` serves as the prefix-"
|
||||
"cache node."
|
||||
msgstr ""
|
||||
"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 `AscendStoreConnector`。"
|
||||
"`MooncakeConnectorV1` 执行 kv_transfer,而 `AscendStoreConnector` 作为 prefix-cache 节点。"
|
||||
"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 "
|
||||
"`AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer,而 "
|
||||
"`AscendStoreConnector` 作为 prefix-cache 节点。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:146
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:611
|
||||
@@ -379,9 +377,10 @@ msgid ""
|
||||
"AscendStoreConnector. If the Prefill node enables PP, `prefill_pp_size` "
|
||||
"or `prefill_pp_layer_partition` also needs to be set. Example as follows:"
|
||||
msgstr ""
|
||||
"目前,PD 解耦中的键值池默认仅存储 Prefill 节点生成的 kv cache。在使用 MLA 的模型中,现已支持 Decode 节点存储 kv cache 供 "
|
||||
"Prefill 节点使用,通过在 AscendStoreConnector 中添加 `consumer_is_to_put: true` 来启用。如果 Prefill "
|
||||
"节点启用了 PP,则还需要设置 `prefill_pp_size` 或 `prefill_pp_layer_partition`。示例如下:"
|
||||
"目前,PD 解耦中的键值池默认仅存储 Prefill 节点生成的 kv cache。在使用 MLA 的模型中,现已支持 Decode 节点存储 "
|
||||
"kv cache 供 Prefill 节点使用,通过在 AscendStoreConnector 中添加 `consumer_is_to_put:"
|
||||
" true` 来启用。如果 Prefill 节点启用了 PP,则还需要设置 `prefill_pp_size` 或 "
|
||||
"`prefill_pp_layer_partition`。示例如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:308
|
||||
msgid "2、Start proxy_server"
|
||||
@@ -452,7 +451,10 @@ msgid ""
|
||||
"required. Establishing these connections introduces a one-time time "
|
||||
"overhead and persistent device memory consumption (4 MB of device memory "
|
||||
"per connection)."
|
||||
msgstr "这是因为当涉及设备到设备通信时,HCCL 单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 MB 设备内存)。"
|
||||
msgstr ""
|
||||
"这是因为当涉及设备到设备通信时,HCCL "
|
||||
"单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 "
|
||||
"MB 设备内存)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:404
|
||||
msgid ""
|
||||
@@ -473,19 +475,25 @@ msgstr "安装 Memcache"
|
||||
msgid ""
|
||||
"**MemCache depends on MemFabric. Therefore, MemFabric must be "
|
||||
"installed.Installing the memcache after the memfabric is installed.**"
|
||||
msgstr "**MemCache 依赖于 MemFabric。因此,必须先安装 MemFabric。在 memfabric 安装完成后,再安装 memcache。**"
|
||||
msgstr ""
|
||||
"**MemCache 依赖于 MemFabric。因此,必须先安装 MemFabric。在 memfabric 安装完成后,再安装 "
|
||||
"memcache。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:412
|
||||
msgid ""
|
||||
"**memfabric_hybrid**: "
|
||||
"<https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
|
||||
msgstr "**memfabric_hybrid**: <https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
|
||||
msgstr ""
|
||||
"**memfabric_hybrid**: "
|
||||
"<https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:414
|
||||
msgid ""
|
||||
"**memcache**: "
|
||||
"<https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
|
||||
msgstr "**memcache**: <https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
|
||||
msgstr ""
|
||||
"**memcache**: "
|
||||
"<https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:416
|
||||
msgid "Configuring the memcache Config File"
|
||||
@@ -509,7 +517,9 @@ msgid ""
|
||||
"You are advised to copy mmc-local.conf and mmc-meta.conf to your own path"
|
||||
" and modify them, and set the MMC_META_CONFIG_PATH environment variable "
|
||||
"to the path of your own mmc-meta.conf file."
|
||||
msgstr "建议您将 mmc-local.conf 和 mmc-meta.conf 复制到您自己的路径并进行修改,并将 MMC_META_CONFIG_PATH 环境变量设置为您自己的 mmc-meta.conf 文件的路径。"
|
||||
msgstr ""
|
||||
"建议您将 mmc-local.conf 和 mmc-meta.conf 复制到您自己的路径并进行修改,并将 "
|
||||
"MMC_META_CONFIG_PATH 环境变量设置为您自己的 mmc-meta.conf 文件的路径。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:436
|
||||
msgid "**mmc-meta.conf:**"
|
||||
@@ -574,7 +584,9 @@ msgid ""
|
||||
" ROCE available, recommended for A2), `device_sdma` (supported for A3 "
|
||||
"when HCCS available, recommended for A3). Currently does not support "
|
||||
"heterogeneous protocol setting."
|
||||
msgstr "`host_rdma` (默认), `device_rdma` (A2 和 A3 在设备 ROCE 可用时支持,推荐用于 A2), `device_sdma` (A3 在 HCCS 可用时支持,推荐用于 A3)。目前不支持异构协议设置。"
|
||||
msgstr ""
|
||||
"`host_rdma` (默认), `device_rdma` (A2 和 A3 在设备 ROCE 可用时支持,推荐用于 A2), "
|
||||
"`device_sdma` (A3 在 HCCS 可用时支持,推荐用于 A3)。目前不支持异构协议设置。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md
|
||||
msgid "`ock.mmc.local_service.dram.size`"
|
||||
@@ -607,7 +619,10 @@ msgid ""
|
||||
"Using `MultiConnector` to simultaneously utilize both "
|
||||
"`MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` "
|
||||
"performs kv_transfer, while `AscendStoreConnector` enables KV Cache Pool"
|
||||
msgstr "使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 `AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer,而 `AscendStoreConnector` 启用 KV 缓存池"
|
||||
msgstr ""
|
||||
"使用 `MultiConnector` 同时利用 `MooncakeConnectorV1` 和 "
|
||||
"`AscendStoreConnector`。`MooncakeConnectorV1` 执行 kv_transfer,而 "
|
||||
"`AscendStoreConnector` 启用 KV 缓存池"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:609
|
||||
#: ../../source/user_guide/feature_guide/kv_pool.md:918
|
||||
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -29,35 +29,33 @@ msgstr "概述"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:5
|
||||
msgid ""
|
||||
"**Layer Shard Linear** is a memory-optimization feature designed for "
|
||||
"**Layer Sharding Linear** is a memory-optimization feature designed for "
|
||||
"large language model (LLM) inference. It addresses the high memory "
|
||||
"pressure caused by **repeated linear operators across many layers** that "
|
||||
"share identical structure but have distinct weights."
|
||||
msgstr ""
|
||||
"**层分片线性算子** 是一项为大语言模型推理设计的内存优化功能。它旨在解决由**跨越多层的重复线性算子**所引起的高内存压力,这些算子结构相同但权重不同。"
|
||||
"**层分片线性算子** "
|
||||
"是一项为大语言模型推理设计的内存优化功能。它旨在解决由**跨越多层的重复线性算子**所引起的高内存压力,这些算子结构相同但权重不同。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:7
|
||||
msgid ""
|
||||
"Instead of replicating all weights on every device, **Layer Shard Linear "
|
||||
"shards the weights of a \"series\" of such operators across the NPU "
|
||||
"devices in a communication group**:"
|
||||
msgstr ""
|
||||
"与在每个设备上复制所有权重不同,**层分片线性算子将此类算子的一个\"系列\"的权重分片到通信组内的NPU设备上**:"
|
||||
"Instead of replicating all weights on every device, **Layer Sharding "
|
||||
"Linear shards the weights of a \"series\" of such operators across the "
|
||||
"NPU devices in a communication group**:"
|
||||
msgstr "与在每个设备上复制所有权重不同,**层分片线性算子将此类算子的一个\"系列\"的权重分片到通信组内的NPU设备上**:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:9
|
||||
msgid ""
|
||||
"The **i-th layer's linear weight** is stored **only on device `i % K`**, "
|
||||
"The **i-th layer's linear weight** is stored **only on device `i % K`** "
|
||||
"where `K` is the number of devices in the group."
|
||||
msgstr ""
|
||||
"**第 i 层的线性权重** **仅存储在设备 `i % K` 上**,其中 `K` 是组内的设备数量。"
|
||||
msgstr "**第 i 层的线性权重** **仅存储在设备 `i % K` 上**,其中 `K` 是组内的设备数量。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:10
|
||||
msgid ""
|
||||
"Other devices hold a lightweight **shared dummy tensor** during "
|
||||
"initialization and fetch the real weight **on-demand** via asynchronous "
|
||||
"broadcast during the forward pass."
|
||||
msgstr ""
|
||||
"其他设备在初始化期间持有一个轻量级的**共享虚拟张量**,并在前向传播期间通过异步广播**按需**获取真实权重。"
|
||||
msgstr "其他设备在初始化期间持有一个轻量级的**共享虚拟张量**,并在前向传播期间通过异步广播**按需**获取真实权重。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:12
|
||||
msgid ""
|
||||
@@ -75,8 +73,7 @@ msgstr ""
|
||||
msgid ""
|
||||
"This approach **preserves exact computational semantics** while "
|
||||
"**significantly reducing NPU memory footprint**, especially critical for:"
|
||||
msgstr ""
|
||||
"这种方法**保持了精确的计算语义**,同时**显著减少了NPU内存占用**,这对于以下情况尤其关键:"
|
||||
msgstr "这种方法**保持了精确的计算语义**,同时**显著减少了NPU内存占用**,这对于以下情况尤其关键:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:16
|
||||
msgid "Extremely deep architectures (e.g., DeepSeek-V3/R1 with 61 layers);"
|
||||
@@ -89,7 +86,9 @@ msgid ""
|
||||
"/vllm-ascend/pull/4188)**, where the full `O` (output) projection matrix "
|
||||
"must reside in memory per layer;"
|
||||
msgstr ""
|
||||
"使用 **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** 或 **[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)** 的模型,其中完整的`O`(输出)投影矩阵必须驻留在每层的内存中;"
|
||||
"使用 **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** 或 "
|
||||
"**[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)** "
|
||||
"的模型,其中完整的`O`(输出)投影矩阵必须驻留在每层的内存中;"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:18
|
||||
msgid ""
|
||||
@@ -111,12 +110,13 @@ msgstr "层分片"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:26
|
||||
msgid ""
|
||||
"**Figure.** Layer Shard Linear workflow: weights are sharded by layer "
|
||||
"**Figure.** Layer Sharding Linear workflow: weights are sharded by layer "
|
||||
"across devices (top), and during forward execution (bottom), asynchronous"
|
||||
" broadcast **pre-fetches** the next layer's weight while the current "
|
||||
"layer computes—enabling **zero-overhead** weight loading."
|
||||
"layer computes-enabling **zero-overhead** weight loading."
|
||||
msgstr ""
|
||||
"**图.** 层分片线性算子工作流程:权重按层分片到各设备(顶部),在前向执行期间(底部),异步广播**预取**下一层的权重,同时当前层进行计算——实现**零开销**的权重加载。"
|
||||
"**图.** "
|
||||
"层分片线性算子工作流程:权重按层分片到各设备(顶部),在前向执行期间(底部),异步广播**预取**下一层的权重,同时当前层进行计算——实现**零开销**的权重加载。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:30
|
||||
msgid "Getting Started"
|
||||
@@ -124,11 +124,12 @@ msgstr "快速开始"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:32
|
||||
msgid ""
|
||||
"To enable **Layer Shard Linear**, specify the target linear layers using "
|
||||
"the `--additional-config` argument when launching your inference job. For"
|
||||
" example, to shard the `o_proj` and `q_b_proj` layers, use:"
|
||||
"To enable **Layer Sharding Linear**, specify the target linear layers "
|
||||
"using the `--additional-config` argument when launching your inference "
|
||||
"job. For example, to shard the `o_proj` and `q_b_proj` layers, use:"
|
||||
msgstr ""
|
||||
"要启用**层分片线性算子**,请在启动推理作业时使用 `--additional-config` 参数指定目标线性层。例如,要对 `o_proj` 和 `q_b_proj` 层进行分片,请使用:"
|
||||
"要启用**层分片线性算子**,请在启动推理作业时使用 `--additional-config` 参数指定目标线性层。例如,要对 `o_proj`"
|
||||
" 和 `q_b_proj` 层进行分片,请使用:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:40
|
||||
msgid ""
|
||||
@@ -136,7 +137,8 @@ msgid ""
|
||||
"be enabled on the **P node** with `kv_role=\"kv_producer\"`. "
|
||||
"`kv_role=\"kv_consumer\"` and `kv_role=\"kv_both\"` are not supported."
|
||||
msgstr ""
|
||||
"**限制** 在PD解耦部署中,层分片只能在 `kv_role=\"kv_producer\"` 的 **P节点** 上启用。不支持 `kv_role=\"kv_consumer\"` 和 `kv_role=\"kv_both\"`。"
|
||||
"**限制** 在PD解耦部署中,层分片只能在 `kv_role=\"kv_producer\"` 的 **P节点** 上启用。不支持 "
|
||||
"`kv_role=\"kv_consumer\"` 和 `kv_role=\"kv_both\"`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:46
|
||||
msgid "Supported Scenarios"
|
||||
@@ -157,7 +159,8 @@ msgid ""
|
||||
"resident in memory for each layer. Layer sharding significantly reduces "
|
||||
"memory pressure by distributing these weights across devices."
|
||||
msgstr ""
|
||||
"当使用 [FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188) 时,完整的输出投影(`o_proj`)矩阵必须驻留在每层的内存中。层分片通过将这些权重分布到各设备上,显著降低了内存压力。"
|
||||
"当使用 [FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188) "
|
||||
"时,完整的输出投影(`o_proj`)矩阵必须驻留在每层的内存中。层分片通过将这些权重分布到各设备上,显著降低了内存压力。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:54
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:71
|
||||
@@ -175,11 +178,12 @@ msgid ""
|
||||
"stored per layer. Sharding these layers across NPUs helps fit extremely "
|
||||
"deep models (e.g., 61-layer architectures) into limited device memory."
|
||||
msgstr ""
|
||||
"使用 [DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702) 时,`q_b_proj` 和 `o_proj` 层都需要每层存储大型权重矩阵。将这些层分片到多个NPU上有助于将极深的模型(例如,61层架构)装入有限的设备内存中。"
|
||||
"使用 [DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702) "
|
||||
"时,`q_b_proj` 和 `o_proj` "
|
||||
"层都需要每层存储大型权重矩阵。将这些层分片到多个NPU上有助于将极深的模型(例如,61层架构)装入有限的设备内存中。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/layer_sharding.md:69
|
||||
msgid ""
|
||||
"In PD-disaggregated deployments, this mode is supported only on the **P "
|
||||
"node** with `kv_role=\"kv_producer\"`."
|
||||
msgstr ""
|
||||
"在PD解耦部署中,此模式仅在 `kv_role=\"kv_producer\"` 的 **P节点** 上受支持。"
|
||||
msgstr "在PD解耦部署中,此模式仅在 `kv_role=\"kv_producer\"` 的 **P节点** 上受支持。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -21,13 +21,12 @@ msgstr ""
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:1
|
||||
msgid "Netloader Guide"
|
||||
msgstr "网络加载器指南"
|
||||
msgstr "Netloader 指南"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:3
|
||||
msgid ""
|
||||
"This guide provides instructions for using **Netloader** as a weight-"
|
||||
"loader plugin for acceleration in **vLLM Ascend**."
|
||||
msgstr "本指南介绍如何将 **Netloader** 用作权重加载器插件,以在 **vLLM Ascend** 中实现加速。"
|
||||
"This guide provides instructions for using **Netloader** as a weight-loader plugin for acceleration in **vLLM Ascend**."
|
||||
msgstr "本指南介绍如何使用 **Netloader** 作为权重加载器插件,以在 **vLLM Ascend** 中实现加速。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:7
|
||||
msgid "Overview"
|
||||
@@ -35,9 +34,7 @@ msgstr "概述"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:9
|
||||
msgid ""
|
||||
"Netloader leverages high-bandwidth peer-to-peer (P2P) transfers between "
|
||||
"NPU cards to load model weights. It is implemented as a plugin (via the "
|
||||
"`register_model_loader` API added in vLLM 0.10). The workflow is:"
|
||||
"Netloader leverages high-bandwidth peer-to-peer (P2P) transfers between NPU cards to load model weights. It is implemented as a plugin (via the `register_model_loader` API added in vLLM 0.10). The workflow is:"
|
||||
msgstr "Netloader 利用 NPU 卡之间的高带宽点对点 (P2P) 传输来加载模型权重。它通过插件实现(使用 vLLM 0.10 中添加的 `register_model_loader` API)。工作流程如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:11
|
||||
@@ -50,16 +47,12 @@ msgstr "新的 **客户端** 实例请求权重传输。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:13
|
||||
msgid ""
|
||||
"After validating that the model and partitioning match, the client uses "
|
||||
"HCCL collective communication (send/recv) to receive weights in the same "
|
||||
"order as stored in the model."
|
||||
"After validating that the model and partitioning match, the client uses HCCL collective communication (send/recv) to receive weights in the same order as stored in the model."
|
||||
msgstr "在验证模型和分区匹配后,客户端使用 HCCL 集合通信 (send/recv) 按照模型中存储的相同顺序接收权重。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:15
|
||||
msgid ""
|
||||
"The server runs alongside normal inference tasks via sub-threads and via "
|
||||
"`stateless_init_torch_distributed_process_group` in vLLM. The client thus"
|
||||
" takes over weight initialization without needing to load from storage."
|
||||
"The server runs alongside normal inference tasks via sub-threads and via `stateless_init_torch_distributed_process_group` in vLLM. The client thus takes over weight initialization without needing to load from storage."
|
||||
msgstr "服务器通过子线程以及 vLLM 中的 `stateless_init_torch_distributed_process_group` 与常规推理任务并行运行。因此,客户端接管权重初始化,无需从存储加载。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:17
|
||||
@@ -68,11 +61,11 @@ msgstr "流程图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:19
|
||||
msgid ""
|
||||
msgstr ""
|
||||
msgstr ""
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:19
|
||||
msgid "netloader flowchart"
|
||||
msgstr "网络加载器流程图"
|
||||
msgstr "netloader 流程图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:21
|
||||
msgid "Timing Diagram"
|
||||
@@ -80,11 +73,11 @@ msgstr "时序图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:23
|
||||
msgid ""
|
||||
msgstr ""
|
||||
msgstr ""
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:23
|
||||
msgid "netloader timing diagram"
|
||||
msgstr "网络加载器时序图"
|
||||
msgstr "netloader 时序图"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:25
|
||||
msgid "Application Scenarios"
|
||||
@@ -92,30 +85,22 @@ msgstr "应用场景"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:27
|
||||
msgid ""
|
||||
"**Reduce startup latency**: By reusing already loaded weights and "
|
||||
"transferring them directly between NPU cards, Netloader cuts down model "
|
||||
"loading time versus conventional remote/local pull strategies."
|
||||
"**Reduce startup latency**: By reusing already loaded weights and transferring them directly between NPU cards, Netloader cuts down model loading time versus conventional remote/local pull strategies."
|
||||
msgstr "**减少启动延迟**:通过重用已加载的权重并在 NPU 卡之间直接传输,Netloader 相比传统的远程/本地拉取策略,缩短了模型加载时间。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:28
|
||||
msgid ""
|
||||
"**Relieve network & storage load**: Avoid repeated downloads of weight "
|
||||
"files from remote repositories, thus reducing pressure on central storage"
|
||||
" and network traffic."
|
||||
"**Relieve network & storage load**: Avoid repeated downloads of weight files from remote repositories, thus reducing pressure on central storage and network traffic."
|
||||
msgstr "**减轻网络和存储负载**:避免从远程仓库重复下载权重文件,从而减轻中心存储和网络流量的压力。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:29
|
||||
msgid ""
|
||||
"**Improve resource utilization & lower cost**: Faster loading allows less"
|
||||
" reliance on standby compute nodes; resources can be scaled up/down more "
|
||||
"flexibly."
|
||||
"**Improve resource utilization & lower cost**: Faster loading allows less reliance on standby compute nodes; resources can be scaled up/down more flexibly."
|
||||
msgstr "**提高资源利用率并降低成本**:更快的加载速度减少了对备用计算节点的依赖;资源可以更灵活地伸缩。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:30
|
||||
msgid ""
|
||||
"**Enhance business continuity & high availability**: In failure recovery,"
|
||||
" new instances can quickly take over without long downtime, improving "
|
||||
"system reliability and user experience."
|
||||
"**Enhance business continuity & high availability**: In failure recovery, new instances can quickly take over without long downtime, improving system reliability and user experience."
|
||||
msgstr "**增强业务连续性和高可用性**:在故障恢复时,新实例可以快速接管而无需长时间停机,从而提高系统可靠性和用户体验。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:34
|
||||
@@ -124,9 +109,7 @@ msgstr "使用方法"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:36
|
||||
msgid ""
|
||||
"To enable Netloader, pass `--load-format=netloader` and provide "
|
||||
"configuration via `--model-loader-extra-config` (as a JSON string). Below"
|
||||
" are the supported configuration fields:"
|
||||
"To enable Netloader, pass `--load-format=netloader` and provide configuration via `--model-loader-extra-config` (as a JSON string). Below are the supported configuration fields:"
|
||||
msgstr "要启用 Netloader,请传递 `--load-format=netloader` 并通过 `--model-loader-extra-config`(作为 JSON 字符串)提供配置。以下是支持的配置字段:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -156,12 +139,7 @@ msgstr "列表"
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"Weight data sources. Each item is a map with `device_id` and `sources`, "
|
||||
"specifying the rank and its endpoints (IP:port). <br>Example: "
|
||||
"`{\"SOURCE\": [{\"device_id\": 0, \"sources\": "
|
||||
"[\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": "
|
||||
"[\"10.170.22.152:11228\"]}]}` <br>If omitted or empty, fallback to "
|
||||
"default loader. The SOURCE here is second priority."
|
||||
"Weight data sources. Each item is a map with `device_id` and `sources`, specifying the rank and its endpoints (IP:port). <br>Example: `{\"SOURCE\": [{\"device_id\": 0, \"sources\": [\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": [\"10.170.22.152:11228\"]}]}` <br>If omitted or empty, fallback to default loader. The SOURCE here is second priority."
|
||||
msgstr "权重数据源。每个条目是一个包含 `device_id` 和 `sources` 的映射,指定了 rank 及其端点 (IP:端口)。<br>示例:`{\"SOURCE\": [{\"device_id\": 0, \"sources\": [\"10.170.22.152:19374\"]}, {\"device_id\": 1, \"sources\": [\"10.170.22.152:11228\"]}]}` <br>如果省略或为空,则回退到默认加载器。此处的 SOURCE 是第二优先级。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -198,9 +176,7 @@ msgstr "服务器监听器的基础端口。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
msgid ""
|
||||
"The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port "
|
||||
"is chosen. Valid range: 1024–65535. If out of range, that server instance"
|
||||
" won’t open a listener."
|
||||
"The actual port = `LISTEN_PORT + RANK`. If omitted, a random valid port is chosen. Valid range: 1024–65535. If out of range, that server instance won’t open a listener."
|
||||
msgstr "实际端口 = `LISTEN_PORT + RANK`。如果省略,则选择一个随机有效端口。有效范围:1024–65535。如果超出范围,该服务器实例将不会打开监听器。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -213,10 +189,7 @@ msgstr "处理量化模型中 int8 参数的行为。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
msgid ""
|
||||
"One of `[\"hbm\", \"dram\", \"no\"]`. <br> - `hbm`: copy original int8 "
|
||||
"parameters to high-bandwidth memory (HBM) (may cost a lot of HBM). <br> -"
|
||||
" `dram`: copy to DRAM. <br> - `no`: no special handling (may lead to "
|
||||
"divergence or unpredictable behavior). Default: `\"no\"`."
|
||||
"One of `[\"hbm\", \"dram\", \"no\"]`. <br> - `hbm`: copy original int8 parameters to high-bandwidth memory (HBM) (may cost a lot of HBM). <br> - `dram`: copy to DRAM. <br> - `no`: no special handling (may lead to divergence or unpredictable behavior). Default: `\"no\"`."
|
||||
msgstr "取值为 `[\"hbm\", \"dram\", \"no\"]` 之一。<br> - `hbm`:将原始 int8 参数复制到高带宽内存 (HBM)(可能消耗大量 HBM)。<br> - `dram`:复制到 DRAM。<br> - `no`:不进行特殊处理(可能导致分歧或不可预测的行为)。默认值:`\"no\"`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -242,8 +215,7 @@ msgstr "在服务器模式下,用于写入每个 rank 监听器地址/端口
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"If set, each rank writes to `{OUTPUT_PREFIX}{RANK}.txt` (text), content ="
|
||||
" `IP:Port`."
|
||||
"If set, each rank writes to `{OUTPUT_PREFIX}{RANK}.txt` (text), content = `IP:Port`."
|
||||
msgstr "如果设置,每个 rank 将写入 `{OUTPUT_PREFIX}{RANK}.txt`(文本文件),内容为 `IP:Port`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
@@ -256,8 +228,7 @@ msgstr "指定上述配置的 JSON 文件路径。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md
|
||||
msgid ""
|
||||
"If provided, the SOURCE inside this file has **first priority** "
|
||||
"(overrides SOURCE in other configs)."
|
||||
"If provided, the SOURCE inside this file has **first priority** (overrides SOURCE in other configs)."
|
||||
msgstr "如果提供,此文件内的 SOURCE 具有 **最高优先级**(覆盖其他配置中的 SOURCE)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:50
|
||||
@@ -294,14 +265,12 @@ msgstr "`<port>`:服务器上的基础监听端口"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:85
|
||||
msgid ""
|
||||
"`<server_IP>` + `<server_Port>`: IP and port of the Netloader server "
|
||||
"(from server log)"
|
||||
"`<server_IP>` + `<server_Port>`: IP and port of the Netloader server (from server log)"
|
||||
msgstr "`<server_IP>` + `<server_Port>`:Netloader 服务器的 IP 和端口(来自服务器日志)"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:86
|
||||
msgid ""
|
||||
"`<device_id_diff_from_server>`: Client device ID (must differ from "
|
||||
"server’s)"
|
||||
"`<device_id_diff_from_server>`: Client device ID (must differ from server’s)"
|
||||
msgstr "`<device_id_diff_from_server>`:客户端设备 ID(必须与服务器的不同)"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:87
|
||||
@@ -310,8 +279,7 @@ msgstr "`<client_port>`:客户端监听的端口"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:89
|
||||
msgid ""
|
||||
"After startup, you can test consistency by issuing inference requests "
|
||||
"with temperature = 0 and comparing outputs."
|
||||
"After startup, you can test consistency by issuing inference requests with temperature = 0 and comparing outputs."
|
||||
msgstr "启动后,您可以通过发送 temperature = 0 的推理请求并比较输出来测试一致性。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:93
|
||||
@@ -320,22 +288,15 @@ msgstr "注意事项与限制"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:95
|
||||
msgid ""
|
||||
"If Netloader is used, **each worker process** must bind a listening port."
|
||||
" That port may be user-specified or assigned randomly. If user-specified,"
|
||||
" ensure it is available."
|
||||
"If Netloader is used, **each worker process** must bind a listening port. That port may be user-specified or assigned randomly. If user-specified, ensure it is available."
|
||||
msgstr "如果使用 Netloader,**每个工作进程** 都必须绑定一个监听端口。该端口可以是用户指定的,也可以是随机分配的。如果是用户指定的,请确保其可用。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:96
|
||||
msgid ""
|
||||
"Netloader requires extra HBM memory to establish HCCL connections (i.e. "
|
||||
"`HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient "
|
||||
"capacity (e.g. via `--gpu-memory-utilization`)."
|
||||
msgstr "Netloader 需要额外的 HBM 内存来建立 HCCL 连接(即 `HCCL_BUFFERSIZE`,默认约 200 MB)。用户应预留足够的容量(例如通过 `--gpu-memory-utilization`)。"
|
||||
"Netloader requires extra on-chip memory to establish HCCL connections (i.e. `HCCL_BUFFERSIZE`, default ~200 MB). Users should reserve sufficient capacity (e.g. via `--gpu-memory-utilization`)."
|
||||
msgstr "Netloader 需要额外的片上内存来建立 HCCL 连接(即 `HCCL_BUFFERSIZE`,默认约 200 MB)。用户应预留足够的容量(例如通过 `--gpu-memory-utilization`)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/netloader.md:97
|
||||
msgid ""
|
||||
"It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or"
|
||||
" slow connections/transmissions. Related info: [vLLM Issue "
|
||||
"#16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR "
|
||||
"#16226](https://github.com/vllm-project/vllm/pull/16226)."
|
||||
"It is recommended to set `VLLM_SLEEP_WHEN_IDLE=1` to mitigate unstable or slow connections/transmissions. Related info: [vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226)."
|
||||
msgstr "建议设置 `VLLM_SLEEP_WHEN_IDLE=1` 以缓解不稳定或缓慢的连接/传输。相关信息:[vLLM Issue #16660](https://github.com/vllm-project/vllm/issues/16660), [vLLM PR #16226](https://github.com/vllm-project/vllm/pull/16226)。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -33,11 +33,12 @@ msgid ""
|
||||
"ascend/issues/4715), this is a simple ACLGraph graph mode acceleration "
|
||||
"solution based on Fx graphs."
|
||||
msgstr ""
|
||||
"如 [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715) 中所述,这是一个基于 Fx 图的简单 ACLGraph 图模式加速解决方案。"
|
||||
"如 [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715) "
|
||||
"中所述,这是一个基于 Fx 图的简单 ACLGraph 图模式加速解决方案。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/npugraph_ex.md:7
|
||||
msgid "Using npugraph_ex"
|
||||
msgstr "使用 npugraph_ex"
|
||||
msgid "Using Npugraph_ex"
|
||||
msgstr "使用 Npugraph_ex"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/npugraph_ex.md:9
|
||||
msgid ""
|
||||
@@ -58,4 +59,6 @@ msgid ""
|
||||
"You can find more details about "
|
||||
"[npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)"
|
||||
msgstr ""
|
||||
"您可以在 [npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html) 找到更多详细信息。"
|
||||
"您可以在 "
|
||||
"[npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)"
|
||||
" 找到更多详细信息。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend\n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,8 +35,7 @@ msgid ""
|
||||
"such as PPO, GRPO, or DPO. During training, the policy model typically "
|
||||
"performs autoregressive generation using inference engines like vLLM, "
|
||||
"followed by forward and backward passes for optimization."
|
||||
msgstr ""
|
||||
"睡眠模式是一个专为从NPU内存中卸载模型权重并丢弃KV缓存而设计的API。此功能对于强化学习(RL)后训练工作负载至关重要,特别是在PPO、GRPO或DPO等在线算法中。在训练期间,策略模型通常使用vLLM等推理引擎执行自回归生成,随后进行前向和反向传播以完成优化。"
|
||||
msgstr "睡眠模式是一个专为从NPU内存中卸载模型权重并丢弃KV缓存而设计的API。此功能对于强化学习(RL)后训练工作负载至关重要,特别是在PPO、GRPO或DPO等在线算法中。在训练期间,策略模型通常使用vLLM等推理引擎执行自回归生成,随后进行前向和反向传播以完成优化。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:7
|
||||
msgid ""
|
||||
@@ -44,8 +43,7 @@ msgid ""
|
||||
"parallelism strategies, it becomes crucial to free KV cache and even "
|
||||
"offload model parameters stored within vLLM during training. This ensures"
|
||||
" efficient memory utilization and avoids resource contention on the NPU."
|
||||
msgstr ""
|
||||
"由于生成阶段和训练阶段可能采用不同的模型并行策略,因此在训练期间释放KV缓存,甚至卸载存储在vLLM中的模型参数变得至关重要。这确保了高效的内存利用,并避免了NPU上的资源争用。"
|
||||
msgstr "由于生成阶段和训练阶段可能采用不同的模型并行策略,因此在训练期间释放KV缓存,甚至卸载存储在vLLM中的模型参数变得至关重要。这确保了高效的内存利用,并避免了NPU上的资源争用。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:9
|
||||
msgid "Getting started"
|
||||
@@ -55,11 +53,13 @@ msgstr "快速入门"
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in"
|
||||
" vllm is under a specific memory pool. During model loading and KV cache "
|
||||
" vLLM is under a specific memory pool. During model loading and KV cache "
|
||||
"initialization, we tag the memory as a map: `{\"weight\": data, "
|
||||
"\"kv_cache\": data}`."
|
||||
msgstr ""
|
||||
"当设置 `enable_sleep_mode=True` 时,我们在vllm中管理内存(分配、释放)的方式将在一个特定的内存池下进行。在模型加载和KV缓存初始化期间,我们将内存标记为一个映射:`{\"weight\": data, \"kv_cache\": data}`。"
|
||||
"当设置 `enable_sleep_mode=True` "
|
||||
"时,我们在vLLM中管理内存(分配、释放)的方式将在一个特定的内存池下进行。在模型加载和KV缓存初始化期间,我们将内存标记为一个映射:`{\"weight\":"
|
||||
" data, \"kv_cache\": data}`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:13
|
||||
msgid ""
|
||||
@@ -115,7 +115,8 @@ msgid ""
|
||||
"`export COMPILE_CUSTOM_KERNELS=1`."
|
||||
msgstr ""
|
||||
"由于此功能使用了底层API "
|
||||
"[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html),为了使用睡眠模式,您应遵循[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)并从源码构建。如果您使用的版本低于v0.12.0rc1,请记得设置 `export COMPILE_CUSTOM_KERNELS=1`。"
|
||||
"[AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html),为了使用睡眠模式,您应遵循[安装指南](https://docs.vllm.ai/projects/ascend/en/latest/installation.html)并从源码构建。如果您使用的版本低于v0.12.0rc1,请记得设置"
|
||||
" `export COMPILE_CUSTOM_KERNELS=1`。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/sleep_mode.md:28
|
||||
msgid "Usage"
|
||||
@@ -139,4 +140,5 @@ msgid ""
|
||||
" are under a dev-mode, and explicitly specify the dev environment "
|
||||
"`VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up)."
|
||||
msgstr ""
|
||||
"考虑到可能存在恶意访问的风险,请确保您处于开发模式,并明确指定开发环境变量 `VLLM_SERVER_DEV_MODE` 以开放这些端点(sleep/wake up)。"
|
||||
"考虑到可能存在恶意访问的风险,请确保您处于开发模式,并明确指定开发环境变量 `VLLM_SERVER_DEV_MODE` "
|
||||
"以开放这些端点(sleep/wake up)。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -32,16 +32,16 @@ msgid ""
|
||||
"Unified Cache Management (UCM) provides an external KV-cache storage "
|
||||
"layer designed for prefix-caching scenarios in vLLM/vLLM-Ascend. Unlike "
|
||||
"KV Pooling, which expands prefix-cache capacity only by aggregating "
|
||||
"device memory and therefore remains limited by HBM/DRAM size and lacks "
|
||||
"persistence, UCM decouples compute from storage and adopts a tiered "
|
||||
"design. Each node uses local DRAM as a fast cache, while a shared "
|
||||
"device memory and therefore remains limited by on-chip memory/DRAM size "
|
||||
"and lacks persistence, UCM decouples compute from storage and adopts a "
|
||||
"tiered design. Each node uses local DRAM as a fast cache, while a shared "
|
||||
"backend—such as 3FS or enterprise-grade storage—serves as the persistent "
|
||||
"KV store. This approach removes the capacity ceiling imposed by device "
|
||||
"memory, enables durable and reliable prefix caching, and allows cache "
|
||||
"capacity to scale with the storage system rather than with compute "
|
||||
"resources."
|
||||
msgstr ""
|
||||
"统一缓存管理(UCM)为vLLM/vLLM-Ascend中的前缀缓存场景提供了一个外部的KV缓存存储层。与仅通过聚合设备内存来扩展前缀缓存容量、因此仍受限于HBM/DRAM大小且缺乏持久性的KV池化不同,UCM将计算与存储解耦,并采用分层设计。每个节点使用本地DRAM作为快速缓存,而共享后端(如3FS或企业级存储)则作为持久化的KV存储。这种方法消除了设备内存带来的容量上限,实现了持久可靠的前缀缓存,并使缓存容量能够随存储系统而非计算资源扩展。"
|
||||
"统一缓存管理(UCM)为vLLM/vLLM-Ascend中的前缀缓存场景提供了外部KV缓存存储层。与仅通过聚合设备内存扩展前缀缓存容量、因此仍受限于片上内存/DRAM大小且缺乏持久性的KV池化不同,UCM将计算与存储解耦,并采用分层设计。每个节点使用本地DRAM作为快速缓存,而共享后端(如3FS或企业级存储)则作为持久化KV存储。这种方法消除了设备内存带来的容量上限,实现了持久可靠的前缀缓存,并使缓存容量能够随存储系统而非计算资源扩展。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:7
|
||||
msgid "Prerequisites"
|
||||
@@ -73,7 +73,8 @@ msgid ""
|
||||
"NPU](https://ucm.readthedocs.io/en/latest/getting-"
|
||||
"started/quickstart_vllm_ascend.html)**"
|
||||
msgstr ""
|
||||
"**请参考[昇腾NPU的官方UCM安装指南](https://ucm.readthedocs.io/en/latest/getting-started/quickstart_vllm_ascend.html)**"
|
||||
"**请参考[昇腾NPU的官方UCM安装指南](https://ucm.readthedocs.io/en/latest/getting-"
|
||||
"started/quickstart_vllm_ascend.html)**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:18
|
||||
msgid "Configure UCM for Prefix Caching"
|
||||
@@ -96,11 +97,12 @@ msgid ""
|
||||
"documentation for prefix-caching](https://ucm.readthedocs.io/en/latest"
|
||||
"/user-guide/prefix-cache/nfs_store.html)**"
|
||||
msgstr ""
|
||||
"**有关最新的配置选项,请参考[前缀缓存的官方UCM文档](https://ucm.readthedocs.io/en/latest/user-guide/prefix-cache/nfs_store.html)**"
|
||||
"**有关最新的配置选项,请参考[前缀缓存的官方UCM文档](https://ucm.readthedocs.io/en/latest/user-"
|
||||
"guide/prefix-cache/nfs_store.html)**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:27
|
||||
msgid "A minimal configuration looks like this:"
|
||||
msgstr "一个最小配置示例如下:"
|
||||
msgstr "最小配置示例如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:39
|
||||
msgid "Explanation:"
|
||||
@@ -119,7 +121,8 @@ msgid ""
|
||||
" here. **⚠️ Make sure to replace `\"/mnt/test\"` with your actual "
|
||||
"storage directory.**"
|
||||
msgstr ""
|
||||
"storage_backends:指定用于存储KV块的目录。它可以是本地目录或NFS挂载路径。UCM将在此处存储KV块。**⚠️ 请确保将`\"/mnt/test\"`替换为您的实际存储目录。**"
|
||||
"storage_backends:指定用于存储KV块的目录。可以是本地目录或NFS挂载路径。UCM将在此处存储KV块。**⚠️ "
|
||||
"请确保将`\"/mnt/test\"`替换为您的实际存储目录。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:48
|
||||
msgid "use_direct: Whether to enable direct I/O (optional). Default is `false`."
|
||||
@@ -132,7 +135,8 @@ msgid ""
|
||||
"on Ascend, so it must be set to `false` (all ranks load/dump "
|
||||
"independently)."
|
||||
msgstr ""
|
||||
"load_only_first_rank:控制是否仅rank 0加载KV缓存并将其广播到其他rank。此功能目前在昇腾上不受支持,因此必须设置为`false`(所有rank独立加载/转储)。"
|
||||
"load_only_first_rank:控制是否仅rank "
|
||||
"0加载KV缓存并将其广播到其他rank。此功能目前在昇腾上不受支持,因此必须设置为`false`(所有rank独立加载/转储)。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:55
|
||||
msgid "Launching Inference"
|
||||
@@ -143,7 +147,7 @@ msgid ""
|
||||
"In this guide, we describe **online inference** using vLLM with the UCM "
|
||||
"connector, deployed as an OpenAI-compatible server. For best performance "
|
||||
"with UCM, it is recommended to set `block_size` to 128."
|
||||
msgstr "在本指南中,我们描述使用带有UCM连接器的vLLM进行**在线推理**,部署为OpenAI兼容的服务器。为了获得UCM的最佳性能,建议将`block_size`设置为128。"
|
||||
msgstr "在本指南中,我们描述使用带有UCM连接器的vLLM进行**在线推理**,部署为OpenAI兼容的服务器。为获得UCM的最佳性能,建议将`block_size`设置为128。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:59
|
||||
msgid "To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:"
|
||||
@@ -154,7 +158,9 @@ msgid ""
|
||||
"**⚠️ Make sure to replace `\"/vllm-workspace/unified-cache-"
|
||||
"management/examples/ucm_config_example.yaml\"` with your actual config "
|
||||
"file path.**"
|
||||
msgstr "**⚠️ 请确保将`\"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml\"`替换为您的实际配置文件路径。**"
|
||||
msgstr ""
|
||||
"**⚠️ 请确保将`\"/vllm-workspace/unified-cache-"
|
||||
"management/examples/ucm_config_example.yaml\"`替换为您的实际配置文件路径。**"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:81
|
||||
msgid "If you see the log below:"
|
||||
@@ -176,7 +182,9 @@ msgid ""
|
||||
"way to observe the prefix caching effect is to run the built-in `vllm "
|
||||
"bench` CLI. Executing the following command **twice** in a separate "
|
||||
"terminal shows the improvement clearly."
|
||||
msgstr "在启用`UCMConnector`启动vLLM服务器后,观察前缀缓存效果的最简单方法是运行内置的`vllm bench` CLI。在单独的终端中**两次**执行以下命令可以清晰地展示改进效果。"
|
||||
msgstr ""
|
||||
"在启用`UCMConnector`启动vLLM服务器后,观察前缀缓存效果的最简单方法是运行内置的`vllm "
|
||||
"bench` CLI。在单独的终端中**两次**执行以下命令可以清晰地展示改进效果。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/ucm_deployment.md:112
|
||||
msgid "After the first execution"
|
||||
@@ -216,4 +224,5 @@ msgid ""
|
||||
"cached prefix significantly reduces the initial latency observed by the "
|
||||
"model, yielding an approximate **8× improvement in TTFT** compared to the"
|
||||
" initial run."
|
||||
msgstr "这表明在第二次请求期间,UCM成功从存储后端检索了全部125个缓存的KV块。利用完全缓存的前缀显著减少了模型观察到的初始延迟,与首次运行相比,TTFT实现了约**8倍的提升**。"
|
||||
msgstr ""
|
||||
"这表明在第二次请求期间,UCM成功从存储后端检索了全部125个缓存的KV块。利用完全缓存的前缀显著减少了模型观察到的初始延迟,与首次运行相比,TTFT实现了约**8倍的提升**。"
|
||||
@@ -8,7 +8,7 @@ msgid ""
|
||||
msgstr ""
|
||||
"Project-Id-Version: vllm-ascend \n"
|
||||
"Report-Msgid-Bugs-To: \n"
|
||||
"POT-Creation-Date: 2026-04-14 09:08+0000\n"
|
||||
"POT-Creation-Date: 2026-04-22 08:13+0000\n"
|
||||
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
|
||||
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"Language: zh_CN\n"
|
||||
@@ -35,29 +35,24 @@ msgid ""
|
||||
"L2 cache ahead of time, reducing MTE utilization during the linear layer "
|
||||
"computations and indirectly improving Cube computation efficiency by "
|
||||
"minimizing resource contention and optimizing data flow."
|
||||
msgstr ""
|
||||
"权重预取通过在需要之前将权重预加载到缓存中来优化内存使用,从而最小化模型执行期间因内存访问造成的延迟。线性层有时表现出相对较高的MTE利用率。为了解决这个问题,我们创建了一个专门用于权重预取的独立流水线,该流水线与原始向量计算流水线(如量化、MoE门控top_k、RMSNorm和SwiGlu)并行运行。这种方法允许权重提前预加载到L2缓存中,减少线性层计算期间的MTE利用率,并通过最小化资源争用和优化数据流间接提高Cube计算效率。"
|
||||
msgstr "权重预取通过在需要之前将权重预加载到缓存中来优化内存使用,从而最小化模型执行期间因内存访问造成的延迟。线性层有时表现出相对较高的MTE利用率。为了解决这个问题,我们创建了一个专门用于权重预取的独立流水线,该流水线与原始向量计算流水线(如量化、MoE门控top_k、RMSNorm和SwiGlu)并行运行。这种方法允许权重提前预加载到L2缓存中,减少线性层计算期间的MTE利用率,并通过最小化资源争用和优化数据流间接提高Cube计算效率。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:5
|
||||
msgid ""
|
||||
"Since we use vector computations to hide the weight prefetching pipeline,"
|
||||
" this has an effect on computation. If you prioritize low latency over "
|
||||
"high throughput, it is best not to enable prefetching."
|
||||
msgstr ""
|
||||
"由于我们使用向量计算来隐藏权重预取流水线,这会对计算产生影响。如果您优先考虑低延迟而非高吞吐量,最好不要启用预取。"
|
||||
msgstr "由于我们使用向量计算来隐藏权重预取流水线,这会对计算产生影响。如果您优先考虑低延迟而非高吞吐量,最好不要启用预取。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:7
|
||||
msgid "Quick Start"
|
||||
msgstr "快速开始"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:9
|
||||
#, python-brace-format
|
||||
msgid ""
|
||||
"With `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
|
||||
"true}}'` to open weight prefetch."
|
||||
msgstr ""
|
||||
"使用 `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
|
||||
"true}}'` 来开启权重预取。"
|
||||
"Use `--additional-config '{\"weight_prefetch_config\": {\"enabled\": "
|
||||
"true}}'` to enable weight prefetch."
|
||||
msgstr "使用 `--additional-config '{\"weight_prefetch_config\": {\"enabled\": true}}'` 来开启权重预取。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:11
|
||||
msgid "Fine-tune Prefetch Ratio"
|
||||
@@ -72,25 +67,21 @@ msgid ""
|
||||
"performance degradation. To accommodate different scenarios, we have "
|
||||
"added `prefetch_ratio` to allow for flexible size configuration based on "
|
||||
"the specific workload, details as follows:"
|
||||
msgstr ""
|
||||
"由于权重预取使用向量计算来隐藏权重预取流水线,预取大小的设置至关重要。如果大小太小,则无法充分发挥优化优势;而较大的大小可能导致资源争用,从而导致性能下降。为了适应不同的场景,我们添加了`prefetch_ratio`,允许根据具体工作负载灵活配置大小,详情如下:"
|
||||
msgstr "由于权重预取使用向量计算来隐藏权重预取流水线,预取大小的设置至关重要。如果大小太小,则无法充分发挥优化优势;而较大的大小可能导致资源争用,从而导致性能下降。为了适应不同的场景,我们添加了`prefetch_ratio`,允许根据具体工作负载灵活配置大小,详情如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:15
|
||||
msgid ""
|
||||
"With `prefetch_ratio` in `\"weight_prefetch_config\"` to custom the "
|
||||
"weight prefetch ratio for specific linear layers."
|
||||
msgstr ""
|
||||
"使用`\"weight_prefetch_config\"`中的`prefetch_ratio`来为特定的线性层自定义权重预取比例。"
|
||||
msgstr "使用`\"weight_prefetch_config\"`中的`prefetch_ratio`来为特定的线性层自定义权重预取比例。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:17
|
||||
msgid ""
|
||||
"The “attn” and “moe” configuration options are used for MoE model, "
|
||||
"details as follows:"
|
||||
msgstr ""
|
||||
"“attn”和“moe”配置选项用于MoE模型,详情如下:"
|
||||
msgstr "“attn”和“moe”配置选项用于MoE模型,详情如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:19
|
||||
#, python-brace-format
|
||||
msgid "`\"attn\": { \"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}`"
|
||||
msgstr "`\"attn\": { \"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}`"
|
||||
|
||||
@@ -98,11 +89,9 @@ msgstr "`\"attn\": { \"qkv\": 1.0, \"o\": 1.0}, \"moe\": {\"gate_up\": 0.8}`"
|
||||
msgid ""
|
||||
"The “mlp” configuration option is used to optimize the performance of the"
|
||||
" Dense model, details as follows:"
|
||||
msgstr ""
|
||||
"“mlp”配置选项用于优化Dense模型的性能,详情如下:"
|
||||
msgstr "“mlp”配置选项用于优化Dense模型的性能,详情如下:"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:23
|
||||
#, python-brace-format
|
||||
msgid "`\"mlp\": {\"gate_up\": 1.0, \"down\": 1.0}`"
|
||||
msgstr "`\"mlp\": {\"gate_up\": 1.0, \"down\": 1.0}`"
|
||||
|
||||
@@ -111,8 +100,7 @@ msgid ""
|
||||
"Above value are the default config, the default value has a good "
|
||||
"performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs` is 144, for "
|
||||
"Qwen3-32B-W8A8 when `--max-num-seqs` is 72."
|
||||
msgstr ""
|
||||
"以上值为默认配置,当`--max-num-seqs`为144时,该默认值对Qwen3-235B-A22B-W8A8有良好性能;当`--max-num-seqs`为72时,对Qwen3-32B-W8A8有良好性能。"
|
||||
msgstr "以上值为默认配置,当`--max-num-seqs`为144时,该默认值对Qwen3-235B-A22B-W8A8有良好性能;当`--max-num-seqs`为72时,对Qwen3-32B-W8A8有良好性能。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:27
|
||||
msgid ""
|
||||
@@ -128,8 +116,7 @@ msgid ""
|
||||
"computation operator. In the profiling timeline, a prefetch operation "
|
||||
"appears as a CMO operation on a single stream; this CMO operation is the "
|
||||
"prefetch operation."
|
||||
msgstr ""
|
||||
"然而,这可能不是您场景下的最优配置。对于更高的并发度,可以尝试增加预取大小。对于较低的并发度,预取可能不会带来任何优势,因此可以减少大小或禁用预取。通过收集性能分析数据来确定预取大小是否合适。具体来说,检查预取操作(例如,MLP Down Proj权重预取)所需的时间是否与并行向量计算算子(例如,SwiGlu计算)所需的时间重叠,以及预取操作是否不晚于向量计算算子的完成时间。在性能分析时间线中,预取操作显示为单个流上的CMO操作;此CMO操作即为预取操作。"
|
||||
msgstr "然而,这可能不是您场景下的最优配置。对于更高的并发度,可以尝试增加预取大小。对于较低的并发度,预取可能不会带来任何优势,因此可以减少大小或禁用预取。通过收集性能分析数据来确定预取大小是否合适。具体来说,检查预取操作(例如,MLP Down Proj权重预取)所需的时间是否与并行向量计算算子(例如,SwiGlu计算)所需的时间重叠,以及预取操作是否不晚于向量计算算子的完成时间。在性能分析时间线中,预取操作显示为单个流上的CMO操作;此CMO操作即为预取操作。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:29
|
||||
msgid "Notes:"
|
||||
@@ -140,16 +127,14 @@ msgid ""
|
||||
"Weight prefetch of MLP `down` project prefetch depends on sequence "
|
||||
"parallel, if you want to open for mlp `down` please also enable sequence "
|
||||
"parallel."
|
||||
msgstr ""
|
||||
"MLP `down`投影的权重预取依赖于序列并行,如果您想为mlp `down`开启预取,请同时启用序列并行。"
|
||||
msgstr "MLP `down`投影的权重预取依赖于序列并行,如果您想为mlp `down`开启预取,请同时启用序列并行。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:32
|
||||
msgid ""
|
||||
"Due to the current size of the L2 cache, the maximum prefetch cannot "
|
||||
"exceed 18MB. If `prefetch_ratio * linear_layer_weight_size >= 18 * 1024 *"
|
||||
" 1024` bytes, the backend will only prefetch 18MB."
|
||||
msgstr ""
|
||||
"由于当前L2缓存的大小,最大预取量不能超过18MB。如果`prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024`字节,后端将只预取18MB。"
|
||||
msgstr "由于当前L2缓存的大小,最大预取量不能超过18MB。如果`prefetch_ratio * linear_layer_weight_size >= 18 * 1024 * 1024`字节,后端将只预取18MB。"
|
||||
|
||||
#: ../../source/user_guide/feature_guide/weight_prefetch.md:34
|
||||
msgid "Example"
|
||||
@@ -167,5 +152,4 @@ msgstr "对于Dense模型:"
|
||||
msgid ""
|
||||
"Following is the default configuration that can get a good performance "
|
||||
"for `--max-num-seqs` is 72 for Qwen3-32B-W8A8"
|
||||
msgstr ""
|
||||
"以下是默认配置,当`--max-num-seqs`为72时,该配置可为Qwen3-32B-W8A8带来良好性能"
|
||||
msgstr "以下是默认配置,当`--max-num-seqs`为72时,该配置可为Qwen3-32B-W8A8带来良好性能"
|
||||
Reference in New Issue
Block a user