[doc][main] Correct mistakes in doc (#4945)
### What this PR does / why we need it?
Correct mistakes in doc
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
This commit is contained in:
@@ -33,12 +33,14 @@ The following table lists additional configuration options available in vLLM Asc
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. Restriction: Can only be used when tensor_parallel=1 |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
|
||||
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
|
||||
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
|
||||
| `num_wait_worker_iterations` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. |
|
||||
| `expert_map_record_path` | str | `None` | When dynamic EPLB is completed, save the current expert load heatmap to the specified path. |
|
||||
| `expert_map_record_path` | str | `None` | Save the expert load calculation results to a new expert table in the specified directory. |
|
||||
| `init_redundancy_expert` | int | `0` | Specify redundant experts during initialization. |
|
||||
| `dump_config` | str | `None` | Configuration file path for msprobe dump(eager mode). |
|
||||
|
||||
|
||||
@@ -76,7 +76,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
- Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).
|
||||
|
||||
3. Model Compatibility:
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible.
|
||||
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
|
||||
|
||||
4. Gating Configuration:
|
||||
|
||||
@@ -113,6 +113,7 @@ python3 -m vllm.entrypoints.openai.api_server \
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "20001",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 1
|
||||
|
||||
@@ -7,6 +7,11 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
|
||||
|
||||
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
|
||||
|
||||
Address for downloading models:\
|
||||
base model: https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files \
|
||||
lora model:
|
||||
https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files
|
||||
|
||||
## Example
|
||||
We provide a simple LoRA example here, which enables the ACLGraph mode by default.
|
||||
|
||||
|
||||
@@ -6,13 +6,13 @@ Since version 0.9.0rc2, the quantization feature is experimentally supported by
|
||||
|
||||
## Install ModelSlim
|
||||
|
||||
To quantize a model, you should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
To quantize a model, you should install [ModelSlim](https://gitcode.com/Ascend/msit/tree/master) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
|
||||
Install ModelSlim:
|
||||
|
||||
```bash
|
||||
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit
|
||||
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitcode.com/Ascend/msit/tree/master
|
||||
|
||||
cd msit/msmodelslim
|
||||
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
|
||||
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
|
||||
|
||||
Functional call: https://docs.vllm.ai/en/latest/features/tool_calling/
|
||||
|
||||
You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM Ascend:
|
||||
|
||||
| Feature | Status | Next Step |
|
||||
|
||||
Reference in New Issue
Block a user