[doc][main] Correct mistakes in doc (#4945)
### What this PR does / why we need it?
Correct mistakes in doc
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -70,13 +70,13 @@ msgstr "运行 docker 容器,在单个 NPU 上启动 vLLM 服务器:"
|
||||
#: ../../tutorials/single_npu_multimodal.md:154
|
||||
msgid ""
|
||||
"Add `--max_model_len` option to avoid ValueError that the "
|
||||
"Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the "
|
||||
"Qwen2.5-VL-7B-Instruct model's max_model_len (128000) is larger than the "
|
||||
"maximum number of tokens that can be stored in KV cache. This will differ "
|
||||
"with different NPU series base on the HBM size. Please modify the value "
|
||||
"according to a suitable value for your NPU series."
|
||||
msgstr ""
|
||||
"新增 `--max_model_len` 选项,以避免出现 ValueError,即 Qwen2.5-VL-7B-Instruct "
|
||||
"模型的最大序列长度(128000)大于 KV 缓存可存储的最大 token 数。该数值会根据不同 NPU 系列的 HBM 大小而不同。请根据你的 NPU"
|
||||
"模型的最大模型长度(128000)大于 KV 缓存可存储的最大 token 数。该数值会根据不同 NPU 系列的 HBM 大小而不同。请根据你的 NPU"
|
||||
" 系列,将该值设置为合适的数值。"
|
||||
|
||||
#: ../../tutorials/single_npu_multimodal.md:157
|
||||
|
||||
@@ -169,30 +169,10 @@ msgstr "Qwen2-VL"
|
||||
msgid "Qwen2.5-VL"
|
||||
msgstr "Qwen2.5-VL"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "LLaVA 1.5"
|
||||
msgstr "LLaVA 1.5"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "LLaVA 1.6"
|
||||
msgstr "LLaVA 1.6"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "[#553](https://github.com/vllm-project/vllm-ascend/issues/553)"
|
||||
msgstr "[#553](https://github.com/vllm-project/vllm-ascend/issues/553)"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "InternVL2"
|
||||
msgstr "InternVL2"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "InternVL2.5"
|
||||
msgstr "InternVL2.5"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "Qwen2-Audio"
|
||||
msgstr "Qwen2-Audio"
|
||||
|
||||
#: ../../user_guide/support_matrix/supported_models.md
|
||||
msgid "LLaVA-Next"
|
||||
msgstr "LLaVA-Next"
|
||||
|
||||
@@ -414,7 +414,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
|
||||
:::
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
@@ -33,12 +33,14 @@ The following table lists additional configuration options available in vLLM Asc
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. Restriction: Can only be used when tensor_parallel=1 |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
|
||||
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
|
||||
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
|
||||
| `num_wait_worker_iterations` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. |
|
||||
| `expert_map_record_path` | str | `None` | When dynamic EPLB is completed, save the current expert load heatmap to the specified path. |
|
||||
| `expert_map_record_path` | str | `None` | Save the expert load calculation results to a new expert table in the specified directory. |
|
||||
| `init_redundancy_expert` | int | `0` | Specify redundant experts during initialization. |
|
||||
| `dump_config` | str | `None` | Configuration file path for msprobe dump(eager mode). |
|
||||
|
||||
|
||||
@@ -76,7 +76,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
- Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).
|
||||
|
||||
3. Model Compatibility:
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible.
|
||||
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
|
||||
|
||||
4. Gating Configuration:
|
||||
|
||||
@@ -113,6 +113,7 @@ python3 -m vllm.entrypoints.openai.api_server \
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "20001",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 1
|
||||
|
||||
@@ -7,6 +7,11 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
|
||||
|
||||
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
|
||||
|
||||
Address for downloading models:\
|
||||
base model: https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files \
|
||||
lora model:
|
||||
https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files
|
||||
|
||||
## Example
|
||||
We provide a simple LoRA example here, which enables the ACLGraph mode by default.
|
||||
|
||||
|
||||
@@ -6,13 +6,13 @@ Since version 0.9.0rc2, the quantization feature is experimentally supported by
|
||||
|
||||
## Install ModelSlim
|
||||
|
||||
To quantize a model, you should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
|
||||
Install ModelSlim:
|
||||
|
||||
```bash
|
||||
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master
|
||||
|
||||
cd msit/msmodelslim
|
||||
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
|
||||
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
|
||||
|
||||
Functional call: https://docs.vllm.ai/en/latest/features/tool_calling/
|
||||
|
||||
You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM Ascend:
|
||||
|
||||
| Feature | Status | Next Step |
|
||||
|
||||
Reference in New Issue
Block a user