diff --git a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/single_npu_multimodal.po b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/single_npu_multimodal.po index 0007af08..82ec5d65 100644 --- a/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/single_npu_multimodal.po +++ b/docs/source/locale/zh_CN/LC_MESSAGES/tutorials/single_npu_multimodal.po @@ -70,13 +70,13 @@ msgstr "运行 docker 容器,在单个 NPU 上启动 vLLM 服务器:" #: ../../tutorials/single_npu_multimodal.md:154 msgid "" "Add `--max_model_len` option to avoid ValueError that the " -"Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the " +"Qwen2.5-VL-7B-Instruct model's max_model_len (128000) is larger than the " "maximum number of tokens that can be stored in KV cache. This will differ " "with different NPU series base on the HBM size. Please modify the value " "according to a suitable value for your NPU series." msgstr "" "新增 `--max_model_len` 选项,以避免出现 ValueError,即 Qwen2.5-VL-7B-Instruct " -"模型的最大序列长度(128000)大于 KV 缓存可存储的最大 token 数。该数值会根据不同 NPU 系列的 HBM 大小而不同。请根据你的 NPU" +"模型的最大模型长度(128000)大于 KV 缓存可存储的最大 token 数。该数值会根据不同 NPU 系列的 HBM 大小而不同。请根据你的 NPU" " 系列,将该值设置为合适的数值。" #: ../../tutorials/single_npu_multimodal.md:157 diff --git a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po index 8ec78056..11d5e9e4 100644 --- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po +++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po @@ -169,30 +169,10 @@ msgstr "Qwen2-VL" msgid "Qwen2.5-VL" msgstr "Qwen2.5-VL" -#: ../../user_guide/support_matrix/supported_models.md -msgid "LLaVA 1.5" -msgstr "LLaVA 1.5" - -#: ../../user_guide/support_matrix/supported_models.md -msgid "LLaVA 1.6" -msgstr "LLaVA 1.6" - #: ../../user_guide/support_matrix/supported_models.md msgid "[#553](https://github.com/vllm-project/vllm-ascend/issues/553)" msgstr "[#553](https://github.com/vllm-project/vllm-ascend/issues/553)" -#: ../../user_guide/support_matrix/supported_models.md -msgid "InternVL2" -msgstr "InternVL2" - -#: ../../user_guide/support_matrix/supported_models.md -msgid "InternVL2.5" -msgstr "InternVL2.5" - -#: ../../user_guide/support_matrix/supported_models.md -msgid "Qwen2-Audio" -msgstr "Qwen2-Audio" - #: ../../user_guide/support_matrix/supported_models.md msgid "LLaVA-Next" msgstr "LLaVA-Next" diff --git a/docs/source/tutorials/Qwen-VL-Dense.md b/docs/source/tutorials/Qwen-VL-Dense.md index 1093aeba..534568a3 100644 --- a/docs/source/tutorials/Qwen-VL-Dense.md +++ b/docs/source/tutorials/Qwen-VL-Dense.md @@ -414,7 +414,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \ ``` :::{note} -Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series. +Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series. ::: If your service start successfully, you can see the info shown below: diff --git a/docs/source/user_guide/configuration/additional_config.md b/docs/source/user_guide/configuration/additional_config.md index 5163c102..659eb89a 100644 --- a/docs/source/user_guide/configuration/additional_config.md +++ b/docs/source/user_guide/configuration/additional_config.md @@ -33,12 +33,14 @@ The following table lists additional configuration options available in vLLM Asc | `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. | | `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. | | `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. | -| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. | +| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. Restriction: Can only be used when tensor_parallel=1 | +| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. | +| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. | | `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. | | `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. | | `gate_eplb` | bool | `False` | Whether to enable EPLB only once. | | `num_wait_worker_iterations` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. | -| `expert_map_record_path` | str | `None` | When dynamic EPLB is completed, save the current expert load heatmap to the specified path. | +| `expert_map_record_path` | str | `None` | Save the expert load calculation results to a new expert table in the specified directory. | | `init_redundancy_expert` | int | `0` | Specify redundant experts during initialization. | | `dump_config` | str | `None` | Configuration file path for msprobe dump(eager mode). | diff --git a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md index 90d8c88e..b62f3269 100644 --- a/docs/source/user_guide/feature_guide/eplb_swift_balancer.md +++ b/docs/source/user_guide/feature_guide/eplb_swift_balancer.md @@ -76,7 +76,7 @@ vllm serve Qwen/Qwen3-235B-A22 \ - Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended). 3. Model Compatibility: - - Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible. + - Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible. - Verify model architecture supports dynamic expert routing through --enable-expert-parallel. 4. Gating Configuration: diff --git a/docs/source/user_guide/feature_guide/kv_pool.md b/docs/source/user_guide/feature_guide/kv_pool.md index 48b8e32b..c0b13fca 100644 --- a/docs/source/user_guide/feature_guide/kv_pool.md +++ b/docs/source/user_guide/feature_guide/kv_pool.md @@ -113,6 +113,7 @@ python3 -m vllm.entrypoints.openai.api_server \ "kv_role": "kv_producer", "kv_port": "20001", "kv_connector_extra_config": { + "use_ascend_direct": true, "prefill": { "dp_size": 1, "tp_size": 1 diff --git a/docs/source/user_guide/feature_guide/lora.md b/docs/source/user_guide/feature_guide/lora.md index a2322180..9ba08724 100644 --- a/docs/source/user_guide/feature_guide/lora.md +++ b/docs/source/user_guide/feature_guide/lora.md @@ -7,6 +7,11 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance. +Address for downloading models:\ +base model: https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files \ +lora model: +https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files + ## Example We provide a simple LoRA example here, which enables the ACLGraph mode by default. diff --git a/docs/source/user_guide/feature_guide/quantization.md b/docs/source/user_guide/feature_guide/quantization.md index 8632b74f..8212bb96 100644 --- a/docs/source/user_guide/feature_guide/quantization.md +++ b/docs/source/user_guide/feature_guide/quantization.md @@ -6,13 +6,13 @@ Since version 0.9.0rc2, the quantization feature is experimentally supported by ## Install ModelSlim -To quantize a model, you should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform. +To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform. Install ModelSlim: ```bash # The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified -git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit +git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master cd msit/msmodelslim diff --git a/docs/source/user_guide/support_matrix/supported_features.md b/docs/source/user_guide/support_matrix/supported_features.md index 72d8811e..5b8a1461 100644 --- a/docs/source/user_guide/support_matrix/supported_features.md +++ b/docs/source/user_guide/support_matrix/supported_features.md @@ -2,6 +2,8 @@ The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support. +Functional call: https://docs.vllm.ai/en/latest/features/tool_calling/ + You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM Ascend: | Feature | Status | Next Step |