[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-09 15:37:57 +08:00
committed by GitHub
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions

View File

@@ -14,7 +14,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
- Contributor:
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs and code.
**Responsibility:** Help new contributors onboarding, handle and respond to community questions, review RFCs and code.
**Requirements:** Complete at least 1 contribution. A contributor is someone who consistently and actively participates in a project, including but not limited to issue/review/commits/community involvement.
@@ -37,7 +37,7 @@ Maintainers will be granted write access to the [vllm-project/vllm-ascend](https
### The Principles
- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrate their strong expertise in vLLM/vLLM Ascend through contributions, reviews, and discussions.
- Membership in vLLM Ascend is given to individuals on a merit basis after they demonstrate their strong expertise in vLLM/vLLM Ascend through contributions, reviews, and discussions.
- For membership in the maintainer group, individuals have to demonstrate strong and continued alignment with the overall vLLM/vLLM Ascend principles.

View File

@@ -10,7 +10,7 @@ Read case studies on how users and developers solve real, everyday problems with
- [GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU cluster manager for running AI models. It supports vLLM Ascend since [v0.6.2](https://github.com/gpustack/gpustack/releases/tag/v0.6.2). See more GPUStack performance evaluation information at [this link](https://mp.weixin.qq.com/s/pkytJVjcH9_OnffnsFGaew).
- [verl](https://github.com/volcengine/verl) is a flexible, efficient, and production-ready RL training library for LLMs. It uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0). See more information on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html).
- [verl](https://github.com/volcengine/verl) is a flexible, efficient, and production-ready RL training library for LLMs. It uses vLLM Ascend since [v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0). See more information on [verl x Ascend Quickstart](https://verl.readthedocs.io/en/latest/ascend_tutorial/quick_start/ascend_quick_start.html).
:::{toctree}
:caption: More details

View File

@@ -4,7 +4,7 @@
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
LLaMA-Factory users need to evaluate and inference the model after fine-tuning.
LLaMA-Factory users need to evaluate the model and perform inference after fine-tuning.
**Business challenge**

View File

@@ -174,4 +174,4 @@ Notes:
- `torch-npu`: Ascend Extension for PyTorch (torch-npu) releases a stable version to [PyPi](https://pypi.org/project/torch-npu)
every 3 months, a development version (aka the POC version) every month, and a nightly version every day.
The PyPi stable version **CAN** be used in vLLM Ascend final version, the monthly dev version **ONLY CAN** be used in
vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version and branch.
vLLM Ascend RC version for rapid iteration, and the nightly version **CANNOT** be used in vLLM Ascend any version or branch.

View File

@@ -25,7 +25,7 @@ device: | run op1 | run op2 | run op3 | run op4 | run op5 |
## How to use ACL Graph?
ACL Graph is enabled by default in V1 Engine, just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
ACL Graph is enabled by default in V1 Engine, you just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
## How it works?

View File

@@ -217,19 +217,19 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
1. `request indices`: `[0, 1, 2, 2, 2]`
2. `token positions`: `[3, 2, 5, 6, 7]`
Current **Token IDs table**:
Current **Token IDs table**:
```shell
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
......
......
......
```
```shell
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
......
......
......
```
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
3. `token indices`: `[3, 14, 29, 30, 31]`
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
@@ -281,6 +281,6 @@ Scheduled token count: `[1, 1, 3]`
## At last
If you understand step_1 and step_2, you will know all the following steps.
If you understand step 1 and step 2, you will know all the following steps.
Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.

View File

@@ -16,10 +16,10 @@ enable_custom_op()
## How to add a custom aclnn operation?
- Create a new operation folder under `csrc` directory
- Create `op_host` and `op_kernel` directories for host and kernel source code
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for op being captured into aclgraph
- Create a new operation folder under `csrc` directory.
- Create `op_host` and `op_kernel` directories for host and kernel source code.
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
After a successful build of vllm-ascend, the custom aclnn operation can be invoked in python code.

View File

@@ -201,7 +201,7 @@ To resolve, either reduce total_npus or enlarge the cpuset so that each NPU has
- Logs show the selected binding mode and the allocation plan, for example:
- `[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`
- `The CPU allocation plan is as follows: ...`
- You can verify affinity via taskset or /proc/<pid>/status after startup.
- You can verify affinity via taskset or `/proc/<pid>/status` after startup.
## Limitations & Notes

View File

@@ -20,7 +20,7 @@ vLLM Ascend currently supports two types of connectors for handling KV cache man
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
For step-by-step deployment and configuration, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
---

View File

@@ -100,28 +100,28 @@ If you want to add a new eplb policy to vllm_ascend, you must follow these steps
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
For example:
```python
class RandomLoadBalance(EplbPolicy):
```python
class RandomLoadBalance(EplbPolicy):
def __init__(self, config: DynamicConfig):
super().__init__(config)
def __init__(self, config: DynamicConfig):
super().__init__(config)
def rebalance_experts(self, current_expert_table, expert_workload):
new_table = copy.deepcopy(current_expert_table)
num_layers = len(current_expert_table)
def rebalance_experts(self, current_expert_table, expert_workload):
new_table = copy.deepcopy(current_expert_table)
num_layers = len(current_expert_table)
for i in range(num_layers):
# randomly choose two card
# indices = random.sample(range(num_card), 2)
indices = [3, 1]
for i in range(num_layers):
# randomly choose two card
# indices = random.sample(range(num_card), 2)
indices = [3, 1]
# swap redundant experts
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
new_table[i][indices[1]][-1] = expert_id_to_exchange
# swap redundant experts
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
new_table[i][indices[1]][-1] = expert_id_to_exchange
return 1, [-i for i in range(num_layers)], new_table
```
return 1, [-i for i in range(num_layers)], new_table
```
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.

View File

@@ -1,9 +1,9 @@
# Feature Guide
# Design Documents
This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
:::{toctree}
:caption: Feature Guide
:caption: Design Documents
:maxdepth: 1
patch
cpu_binding

View File

@@ -1,101 +1,101 @@
# Npugraph_ex
## How Does It Work?
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
You can get its code [here](https://gitcode.com/Ascend/torchair)
## Default Fx Graph Optimization
### Fx Graph pass
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
### Fx fusion pass
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
You can get the default fusion pass list [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
## Custom fusion pass
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
Below is the declaration of this API and a demo of its usage.
```python
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
```
|Parameter Name| Input/Output |Explanation|Is necessary|
|--|--------------|---|-------|
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
Usage Example
```python
import functools
import torch, torch_npu, torchair
from torch._inductor.pattern_matcher import Match
from torch._subclasses.fake_tensor import FakeTensorMode
from torchair.core.utils import logger
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
def search_fn(x1, x2, gamma):
xOut = torch.add(x1, x2)
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
return y, xOut
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
def replace_fn(x1, x2, gamma):
y, _, xOut = torch_npu.npu_add_rms_norm(
x1, x2, gamma
)
return y, xOut
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
def extra_check(match: Match):
x1 = match.kwargs.get("x1")
if x1 is None:
return False
if not hasattr(x1, "meta") or "val" not in x1.meta:
return False
a_shape = x1.meta["val"].shape
return a_shape[-1] == 7168
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
fake_mode = FakeTensorMode()
with fake_mode:
# sizes/values don't actually matter for initial trace
# once we get a possible match we re-trace with the actual values and verify the match still holds
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
torchair.register_replacement(
search_fn=search_fn,
replace_fn=replace_fn,
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
extra_check=extra_check
)
```
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
### DFX
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.
# Npugraph_ex
## How Does It Work?
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
You can get its code [code](https://gitcode.com/Ascend/torchair)
## Default Fx Graph Optimization
### Fx Graph pass
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
### Fx fusion pass
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
You can get the default [fusion pass list](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
## Custom fusion pass
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
Below is the declaration of this API and a demo of its usage.
```python
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
```
|Parameter Name| Input/Output |Explanation|Is necessary|
|--|--------------|---|-------|
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
Usage Example
```python
import functools
import torch, torch_npu, torchair
from torch._inductor.pattern_matcher import Match
from torch._subclasses.fake_tensor import FakeTensorMode
from torchair.core.utils import logger
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
def search_fn(x1, x2, gamma):
xOut = torch.add(x1, x2)
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
return y, xOut
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
def replace_fn(x1, x2, gamma):
y, _, xOut = torch_npu.npu_add_rms_norm(
x1, x2, gamma
)
return y, xOut
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
def extra_check(match: Match):
x1 = match.kwargs.get("x1")
if x1 is None:
return False
if not hasattr(x1, "meta") or "val" not in x1.meta:
return False
a_shape = x1.meta["val"].shape
return a_shape[-1] == 7168
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
fake_mode = FakeTensorMode()
with fake_mode:
# sizes/values don't actually matter for initial trace
# once we get a possible match we re-trace with the actual values and verify the match still holds
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
torchair.register_replacement(
search_fn=search_fn,
replace_fn=replace_fn,
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
extra_check=extra_check
)
```
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
### DFX
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.

View File

@@ -72,5 +72,5 @@ Before writing a patch, following the principle above, we should patch the least
## Limitations
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.

View File

@@ -42,7 +42,7 @@ We need to implement the `create_weights`, `process_weights_after_loading`, and
"model.layers.0.mlp.gate.weight": "FLOAT",
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC"
}
```

View File

@@ -41,7 +41,7 @@ After completing "Run lint" setup, you can run CI locally:
cd ~/vllm-project/
# Run CI need vLLM installed
# Run CI needs vLLM installed
git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/build.txt
@@ -52,7 +52,7 @@ cd ..
cd vllm-ascend
# For Linux:
pip install -r requirements-dev.txt
# For non Linux:
# For non-Linux:
cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
@@ -102,7 +102,7 @@ If the PR spans more than one category, please include all relevant prefixes.
## Others
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing).
If you encounter any problems while contributing, feel free to submit a PR to improve the documentation to help other developers.
:::{toctree}

View File

@@ -20,7 +20,7 @@ From the workflow perspective, we can see how the final test script is executed,
2. Add config yaml
As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, A k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, a k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
Suppose you have **2 nodes** running a 1P1D setup (1 Prefillers + 1 Decoder):
@@ -72,7 +72,7 @@ From the workflow perspective, we can see how the final test script is executed,
3. Add the case to nightly workflow
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
```yaml
multi-node-tests:
@@ -113,7 +113,7 @@ The matrix above defines all the parameters required to add a multi-machine use
### 1. Use kubernetes
This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. then you can easily start our test with one click.
This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. Then you can easily start our test with one click.
- Step 1. Install LWS CRD resources

View File

@@ -153,8 +153,8 @@ pip install -r requirements-dev.txt
There are several principles to follow when writing unit tests:
- The test file path should be consistent with the source file and start with the `test_` prefix, such as: `vllm_ascend/worker/worker.py` --> `tests/ut/worker/test_worker.py`
- The vLLM Ascend test uses unittest framework. See [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related function to host.
- The vLLM Ascend test uses unittest framework. See [the Python unittest documentation](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related functions on the host.
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
- You can run the unit tests using `pytest`:
@@ -205,8 +205,8 @@ pytest -sv tests/ut/test_ascend_config.py
### E2E test
Although vllm-ascend CI provides the [E2E test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
locally.
Although vllm-ascend CI provides E2E tests on Ascend CI (for example,
[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), [schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), [pr_test_full.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/pr_test_full.yaml)), you can run them locally.
:::::{tab-set}
:sync-group: e2e

View File

@@ -81,74 +81,74 @@ You can choose one or multiple datasets to execute accuracy evaluation.
1. `C-Eval` dataset.
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
Download dataset and install it to specific path.
Download dataset and install it to specific path.
```shell
cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip
```
```shell
cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip
```
2. `MMLU` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip
```
3. `GPQA` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip
```
4. `MATH` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip
```
5. `LiveCodeBench` dataset.
```shell
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
```
```shell
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
```
6. `AIME 2024` dataset.
```shell
cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip
```
```shell
cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip
```
7. `GSM8K` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip
```
#### Configuration

View File

@@ -87,7 +87,7 @@ After 1 to 2 minutes, the output is shown below:
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
See more detail in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
See more details in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
## 4. Run model inference stress testing using EvalScope

View File

@@ -214,7 +214,7 @@ Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
## Use Offline Datasets
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [here][2].
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [using-local-datasets][2].
```bash
# set HF_DATASETS_OFFLINE when using offline datasets

View File

@@ -4,7 +4,7 @@ This document guides you to conduct accuracy testing using [OpenCompass](https:/
## 1. Online Server
You can run docker container to start the vLLM server on a single NPU:
You can run a docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
@@ -116,7 +116,7 @@ python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
After 1 to 2 minutes, the output is shown below:
```shell
The markdown format results is as below:
The markdown format results are as below:
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
|----- | ----- | ----- | ----- | -----|

View File

@@ -116,7 +116,7 @@ export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
#### 2.2. Tcmalloc
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
```{code-block} bash
:substitutions:
@@ -178,10 +178,10 @@ export HCCL_OP_EXPANSION_MODE="AIV"
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
### 5. OS Optimization

View File

@@ -4,6 +4,12 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
**Legend Description**:
- ✅ = Supported
- 🟡 = Partial / Work in progress
- 🚧 = Under development
## 1. Run docker container
```{code-block} bash

View File

@@ -108,17 +108,13 @@ If all above steps are not working, feel free to submit a GitHub issue.
### 8. Does vllm-ascend support Prefill Disaggregation feature?
Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. See the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) for example.
Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. See the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html) for example.
### 9. Does vllm-ascend support quantization method?
Currently, w8a8, w4a8, and w4a4 quantization methods are already supported by vllm-ascend.
### 10. How to run a W8A8 DeepSeek model?
Follow the [inference tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node.html) and replace the model with DeepSeek.
### 11. How is vllm-ascend tested?
### 10. How is vllm-ascend tested?
vllm-ascend is tested in three aspects: functions, performance, and accuracy.
@@ -132,23 +128,23 @@ vllm-ascend is tested in three aspects: functions, performance, and accuracy.
For each release, we'll publish the performance test and accuracy test report in the future.
### 12. How to fix the error "InvalidVersion" when using vllm-ascend?
### 11. How to fix the error "InvalidVersion" when using vllm-ascend?
The problem is usually caused by the installation of a development or editable version of the vLLM package. In this case, we provide the environment variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the environment variable `VLLM_VERSION` to the version of the vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
### 13. How to handle the out-of-memory issue?
### 12. How to handle the out-of-memory issue?
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory).
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step.
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization).
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
### 14. Failed to enable NPU graph mode when running DeepSeek
### 13. Failed to enable NPU graph mode when running DeepSeek
Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update.
@@ -159,54 +155,54 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
```
### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend
### 14. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend
You may encounter the problem of C/C++ compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, use `python setup.py install` (recommended) to install, or use `python setup.py clean` to clear the cache.
### 16. How to generate deterministic results when using vllm-ascend?
### 15. How to generate deterministic results when using vllm-ascend?
There are several factors that affect output determinism:
1. Sampler method: using **greedy sampling** by setting `temperature=0` in `SamplingParams`, e.g.:
```python
from vllm import LLM, SamplingParams
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
2. Set the following environment parameters:
```bash
export LCCL_DETERMINISTIC=1
export HCCL_DETERMINISTIC=true
export ATB_MATMUL_SHUFFLE_K_ENABLE=0
export ATB_LLM_LCOC_ENABLE=0
```
```bash
export LCCL_DETERMINISTIC=1
export HCCL_DETERMINISTIC=true
export ATB_MATMUL_SHUFFLE_K_ENABLE=0
export ATB_LLM_LCOC_ENABLE=0
```
### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model
### 16. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met, run `pip install qwen-omni-utils`.
This package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring that the audio processing functionality works correctly.
### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
### 17. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
```shell
error example in detail:
@@ -222,11 +218,11 @@ Recommended mitigation strategies:
Root cause analysis:
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream-overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements, such as operator characteristics and specific hardware features, consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
### 19. How to install custom version of torch_npu?
### 18. How to install custom version of torch_npu?
torch-npu will be overridden when installing vllm-ascend. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after vllm-ascend is installed.
### 20. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error
### 19. On certain systems (e.g., Kylin OS), `docker pull` may fail with an `invalid tar header` error
On certain operating systems, such as Kylin OS, you may encounter an `invalid tar header` error during the `docker pull` process:
@@ -253,11 +249,11 @@ This is often due to system compatibility issues. You can resolve this by using
Copy the `vllm_ascend_<tag>.tar` file (where `<tag>` is the image tag you used) to your target machine
### 21. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted"
### 20. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted"
When using `--shm-size`, you may need to add the `--privileged=true` flag to your `docker run` command to grant the container necessary permissions. Please be aware that using `--privileged=true` grants the container extensive privileges on the host system, which can be a security risk. Only use this option if you understand the implications and trust the container's source.
### 22. How to achieve low latency in a small batch scenario?
### 21. How to achieve low latency in a small batch scenario?
The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenarios is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it using the following instruction:
@@ -270,7 +266,7 @@ bash tools/install_flash_infer_attention_score_ops_a2.sh
**NOTE**: Don't set `additional_config.pa_shape_list` when using this method; otherwise, it will lead to another attention operator.
**Important**: Please make sure you're using the **official image** of `vllm-ascend`; otherwise, you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own, or create one. If you're not the root user, you need `sudo` **privileges** to run this script.
### 23. How to set `SOC_VERSION` when building from source on a CPU-only machine?
### 22. How to set `SOC_VERSION` when building from source on a CPU-only machine?
When building from source (e.g. `pip install -e .`), the build may try to infer the target chip via `npu-smi`. If `npu-smi` is not available (common in CPU-only build environments), you must set `SOC_VERSION` manually before installation.

View File

@@ -57,7 +57,7 @@ user_guide/release_notes
:caption: Developer Guide
:maxdepth: 1
developer_guide/contribution/index
developer_guide/feature_guide/index
developer_guide/Design_Documents/index
developer_guide/evaluation/index
developer_guide/performance_and_debug/index
:::

View File

@@ -11,7 +11,7 @@ This document describes how to install vllm-ascend manually.
| Software | Supported version | Note |
|---------------|----------------------------------|-------------------------------------------|
| Ascend HDK | Refer to the documentation [here](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html) | Required for CANN |
| Ascend HDK | Refer to the documentation [CANN 8.3.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/releasenote/releasenote_0000.html) | Required for CANN |
| CANN | == 8.5.1 | Required for vllm-ascend and torch-npu |
| torch-npu | == 2.9.0 | Required for vllm-ascend, No need to install manually, it will be auto installed in below steps |
| torch | == 2.9.0 | Required for torch-npu and vllm |

View File

@@ -36,7 +36,7 @@ Documentation: https://docs.vllm.ai/projects/ascend/en/latest/index.html
## Developer Guide
- [Contribution Guide](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html): setup and contribution workflow.
- [Developer Feature Guide](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/index.html): developer-side feature docs.
- [Developer Design Documents](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/index.html): developer-side feature docs.
- [Evaluation Guide](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/index.html): evaluation workflow.
- [Performance and Debug Index](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/index.html): profiling and debugging docs.

View File

@@ -20,12 +20,12 @@ msgstr ""
"Plural-Forms: nplurals=1; plural=0;\n"
"Generated-By: Babel 2.17.0\n"
#: ../../developer_guide/feature_guide/index.md:1
#: ../../developer_guide/feature_guide/index.md:5
msgid "Feature Guide"
msgstr "功能指南"
#: ../../developer_guide/Design_Documents/index.md:1
#: ../../developer_guide/Design_Documents/index.md:5
msgid "Design Documents"
msgstr "设计文档"
#: ../../developer_guide/feature_guide/index.md:3
#: ../../developer_guide/Design_Documents/index.md:3
msgid ""
"This section provides an overview of the features implemented in vLLM "
"Ascend. Developers can refer to this guide to understand how vLLM Ascend "

View File

@@ -20,11 +20,11 @@ msgstr ""
"Plural-Forms: nplurals=1; plural=0;\n"
"Generated-By: Babel 2.17.0\n"
#: ../../developer_guide/feature_guide/patch.md:1
#: ../../developer_guide/Design_Documents/patch.md:1
msgid "Patch in vLLM Ascend"
msgstr "在 vLLM Ascend 中的补丁"
#: ../../developer_guide/feature_guide/patch.md:3
#: ../../developer_guide/Design_Documents/patch.md:3
msgid ""
"vLLM Ascend is a platform plugin for vLLM. Due to the release cycle of vLLM "
"and vLLM Ascend is different, and the hardware limitation in some case, we "
@@ -33,17 +33,17 @@ msgstr ""
"vLLM Ascend 是 vLLM 的一个平台插件。由于 vLLM 和 vLLM Ascend "
"的发布周期不同,并且在某些情况下存在硬件限制,我们需要对 vLLM 进行一些代码补丁,以使其能够兼容 vLLM Ascend。"
#: ../../developer_guide/feature_guide/patch.md:5
#: ../../developer_guide/Design_Documents/patch.md:5
msgid ""
"In vLLM Ascend code, we provide a patch module `vllm_ascend/patch` to "
"address the change for vLLM."
msgstr "在 vLLM Ascend 代码中,我们提供了一个补丁模块 `vllm_ascend/patch` 用于应对 vLLM 的变更。"
#: ../../developer_guide/feature_guide/patch.md:7
#: ../../developer_guide/Design_Documents/patch.md:7
msgid "Principle"
msgstr "原理"
#: ../../developer_guide/feature_guide/patch.md:9
#: ../../developer_guide/Design_Documents/patch.md:9
msgid ""
"We should keep in mind that Patch is not the best way to make vLLM Ascend "
"compatible. It's just a temporary solution. The best way is to contribute "
@@ -53,29 +53,29 @@ msgstr ""
"我们需要记住Patch 不是让 vLLM 兼容 Ascend 的最佳方式,这只是一个临时的解决方案。最好的方法是将修改贡献到 vLLM 项目中,从而让"
" vLLM 原生支持 Ascend。对于 vLLM Ascend我们对 Patch 策略有一个基本原则:"
#: ../../developer_guide/feature_guide/patch.md:11
#: ../../developer_guide/Design_Documents/patch.md:11
msgid "Less is more. Please do not patch unless it's the only way currently."
msgstr "少即是多。请不要打补丁,除非这是目前唯一的方法。"
#: ../../developer_guide/feature_guide/patch.md:12
#: ../../developer_guide/Design_Documents/patch.md:12
msgid ""
"Once a patch is added, it's required to describe the future plan for "
"removing the patch."
msgstr "一旦补丁被添加,必须说明将来移除该补丁的计划。"
#: ../../developer_guide/feature_guide/patch.md:13
#: ../../developer_guide/Design_Documents/patch.md:13
msgid "Anytime, clean the patch code is welcome."
msgstr "任何时候,欢迎清理补丁代码。"
#: ../../developer_guide/feature_guide/patch.md:15
#: ../../developer_guide/Design_Documents/patch.md:15
msgid "How it works"
msgstr "工作原理"
#: ../../developer_guide/feature_guide/patch.md:17
#: ../../developer_guide/Design_Documents/patch.md:17
msgid "In `vllm_ascend/patch`, you can see the code structure as follows:"
msgstr "在 `vllm_ascend/patch` 目录中,你可以看到如下代码结构:"
#: ../../developer_guide/feature_guide/patch.md:33
#: ../../developer_guide/Design_Documents/patch.md:33
msgid ""
"**platform**: The patch code in this directory is for patching the code in "
"vLLM main process. It's called by "
@@ -85,7 +85,7 @@ msgstr ""
"**platform**:此目录下的补丁代码用于修补 vLLM 主进程中的代码。当 vLLM 初始化时,会在很早的阶段由 "
"`vllm_ascend/platform::NPUPlatform::pre_register_and_update` 调用。"
#: ../../developer_guide/feature_guide/patch.md:34
#: ../../developer_guide/Design_Documents/patch.md:34
msgid ""
"For online mode, vLLM process calls the platform patch here "
"`vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` when parsing "
@@ -94,7 +94,7 @@ msgstr ""
"对于在线模式vLLM 进程在解析命令行参数时,会在 "
"`vllm/vllm/engine/arg_utils.py::AsyncEngineArgs.add_cli_args` 这里调用平台补丁。"
#: ../../developer_guide/feature_guide/patch.md:35
#: ../../developer_guide/Design_Documents/patch.md:35
msgid ""
"For offline mode, vLLM process calls the platform patch here "
"`vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config` when "
@@ -103,7 +103,7 @@ msgstr ""
"对于离线模式vLLM 进程在解析输入参数时,会在此处调用平台补丁 "
"`vllm/vllm/engine/arg_utils.py::EngineArgs.create_engine_config`。"
#: ../../developer_guide/feature_guide/patch.md:36
#: ../../developer_guide/Design_Documents/patch.md:36
msgid ""
"**worker**: The patch code in this directory is for patching the code in "
"vLLM worker process. It's called by "
@@ -113,7 +113,7 @@ msgstr ""
"**worker**:此目录中的补丁代码用于修补 vLLM worker 进程中的代码。在初始化 vLLM worker 进程时,会被 "
"`vllm_ascend/worker/worker::NPUWorker::__init__` 调用。"
#: ../../developer_guide/feature_guide/patch.md:37
#: ../../developer_guide/Design_Documents/patch.md:37
msgid ""
"For both online and offline mode, vLLM engine core process calls the worker "
"patch here `vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker` "
@@ -122,13 +122,13 @@ msgstr ""
"无论是在线还是离线模式vLLM 引擎核心进程在初始化 worker 进程时,都会在这里调用 worker "
"补丁:`vllm/vllm/worker/worker_base.py::WorkerWrapperBase.init_worker`。"
#: ../../developer_guide/feature_guide/patch.md:39
#: ../../developer_guide/Design_Documents/patch.md:39
msgid ""
"In both **platform** and **worker** folder, there are several patch modules."
" They are used for patching different version of vLLM."
msgstr "在 **platform** 和 **worker** 文件夹中都有一些补丁模块。它们用于修补不同版本的 vLLM。"
#: ../../developer_guide/feature_guide/patch.md:41
#: ../../developer_guide/Design_Documents/patch.md:41
msgid ""
"`patch_0_9_2`: This module is used for patching vLLM 0.9.2. The version is "
"always the nearest version of vLLM. Once vLLM is released, we will drop this"
@@ -138,22 +138,22 @@ msgstr ""
"`patch_0_9_2`:此模块用于修补 vLLM 0.9.2。该版本始终对应于 vLLM 的最近版本。一旦 vLLM "
"发布新版本,我们将移除此补丁模块并升级到新版本。例如,`patch_0_9_2` 就是用于修补 vLLM 0.9.2 的。"
#: ../../developer_guide/feature_guide/patch.md:42
#: ../../developer_guide/Design_Documents/patch.md:42
msgid ""
"`patch_main`: This module is used for patching the code in vLLM main branch."
msgstr "`patch_main`:该模块用于修补 vLLM 主分支代码。"
#: ../../developer_guide/feature_guide/patch.md:43
#: ../../developer_guide/Design_Documents/patch.md:43
msgid ""
"`patch_common`: This module is used for patching both vLLM 0.9.2 and vLLM "
"main branch."
msgstr "`patch_common`:此模块用于同时修补 vLLM 0.9.2 版本和 vLLM 主分支。"
#: ../../developer_guide/feature_guide/patch.md:45
#: ../../developer_guide/Design_Documents/patch.md:45
msgid "How to write a patch"
msgstr "如何撰写补丁"
#: ../../developer_guide/feature_guide/patch.md:47
#: ../../developer_guide/Design_Documents/patch.md:47
msgid ""
"Before writing a patch, following the principle above, we should patch the "
"least code. If it's necessary, we can patch the code in either **platform** "
@@ -163,19 +163,19 @@ msgstr ""
"在编写补丁之前,遵循上述原则,我们应尽量修改最少的代码。如果有必要,我们可以修改 **platform** 和 **worker** "
"文件夹中的代码。下面是一个在 vLLM 中修改 `distributed` 模块的示例。"
#: ../../developer_guide/feature_guide/patch.md:49
#: ../../developer_guide/Design_Documents/patch.md:49
msgid ""
"Decide which version of vLLM we should patch. For example, after analysis, "
"here we want to patch both 0.9.2 and main of vLLM."
msgstr "决定我们应该修补哪个版本的 vLLM。例如经过分析后这里我们想要同时修补 vLLM 的 0.9.2 版和主分支main。"
#: ../../developer_guide/feature_guide/patch.md:50
#: ../../developer_guide/Design_Documents/patch.md:50
msgid ""
"Decide which process we should patch. For example, here `distributed` "
"belongs to the vLLM main process, so we should patch `platform`."
msgstr "决定我们应该修补哪个进程。例如,这里 `distributed` 属于 vLLM 主进程,所以我们应该修补 `platform`。"
#: ../../developer_guide/feature_guide/patch.md:51
#: ../../developer_guide/Design_Documents/patch.md:51
#, python-brace-format
msgid ""
"Create the patch file in the right folder. The file should be named as "
@@ -185,11 +185,11 @@ msgstr ""
"在正确的文件夹中创建补丁文件。文件应命名为 `patch_{module_name}.py`。此处的示例是 "
"`vllm_ascend/patch/platform/patch_common/patch_distributed.py`。"
#: ../../developer_guide/feature_guide/patch.md:52
#: ../../developer_guide/Design_Documents/patch.md:52
msgid "Write your patch code in the new file. Here is an example:"
msgstr "在新文件中编写你的补丁代码。以下是一个示例:"
#: ../../developer_guide/feature_guide/patch.md:62
#: ../../developer_guide/Design_Documents/patch.md:62
msgid ""
"Import the patch file in `__init__.py`. In this example, add `import "
"vllm_ascend.patch.platform.patch_common.patch_distributed` into "
@@ -199,13 +199,13 @@ msgstr ""
"vllm_ascend.patch.platform.patch_common.patch_distributed` 添加到 "
"`vllm_ascend/patch/platform/patch_common/__init__.py` 中。"
#: ../../developer_guide/feature_guide/patch.md:63
#: ../../developer_guide/Design_Documents/patch.md:63
msgid ""
"Add the description of the patch in `vllm_ascend/patch/__init__.py`. The "
"description format is as follows:"
msgstr "在 `vllm_ascend/patch/__init__.py` 中添加补丁的描述。描述格式如下:"
#: ../../developer_guide/feature_guide/patch.md:77
#: ../../developer_guide/Design_Documents/patch.md:77
msgid ""
"Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should "
"contain the Unit Test and E2E Test as well. You can find more details in "
@@ -214,11 +214,11 @@ msgstr ""
"添加单元测试和端到端E2E测试。在 vLLM Ascend 中新增的任何代码也应包含单元测试和端到端测试。更多详情请参见 "
"[测试指南](../contribution/testing.md)。"
#: ../../developer_guide/feature_guide/patch.md:80
#: ../../developer_guide/Design_Documents/patch.md:80
msgid "Limitation"
msgstr "限制"
#: ../../developer_guide/feature_guide/patch.md:81
#: ../../developer_guide/Design_Documents/patch.md:81
msgid ""
"In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore "
"process and Worker process. Now vLLM Ascend only support patch the code in "
@@ -232,7 +232,7 @@ msgstr ""
"进程整体打补丁,入口代码在 `vllm.v1.engine.core`。请完全重写 `EngineCoreProc` 和 "
"`DPEngineCoreProc`。"
#: ../../developer_guide/feature_guide/patch.md:82
#: ../../developer_guide/Design_Documents/patch.md:82
msgid ""
"If you are running an edited vLLM code, the version of the vLLM may be "
"changed automatically. For example, if you runs an edited vLLM based on "

View File

@@ -8,7 +8,7 @@ Context parallel feature currently is only supported on Atlas A3 device, and wil
vLLM-Ascend now supports long sequence with context parallel options. This guide takes one-by-one steps to verify these features with constrained resources.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the IP of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance. In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
## Environment Preparation
@@ -69,223 +69,225 @@ We can run the following scripts to launch a server on the prefiller/decoder nod
1. Run the following script to execute online 128k inference on three nodes respectively.
:::::{tab-set}
:sync-group: nodes
:::::{tab-set}
:sync-group: nodes
::::{tab-item} Prefiller node 1
:sync: prefill node1
::::{tab-item} Prefiller node 1
:sync: prefill node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.1"
master_addr="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.1"
master_addr="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--decode-context-parallel-size 8 \
--prefill-context-parallel-size 2 \
--cp-kv-cache-interleave-size 128 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--enforce-eager \
--served-model-name deepseek_v3 \
--seed 1024 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--max-model-len 136000 \
--max-num-batched-tokens 136000 \
--block-size 128 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--nnodes 2 \
--node-rank 0 \
--master-addr $master_addr \
--master-port 7001 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--decode-context-parallel-size 8 \
--prefill-context-parallel-size 2 \
--cp-kv-cache-interleave-size 128 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--enforce-eager \
--served-model-name deepseek_v3 \
--seed 1024 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--max-model-len 136000 \
--max-num-batched-tokens 136000 \
--block-size 128 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--nnodes 2 \
--node-rank 0 \
--master-addr $master_addr \
--master-port 7001 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
::::
::::
::::{tab-item} Prefiller node 2
:sync: prefill node2
::::{tab-item} Prefiller node 2
:sync: prefill node2
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.2"
master_addr="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.2"
master_addr="192.0.0.1"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--decode-context-parallel-size 8 \
--prefill-context-parallel-size 2 \
--cp-kv-cache-interleave-size 128 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--enforce-eager \
--served-model-name deepseek_v3 \
--seed 1024 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--max-model-len 136000 \
--max-num-batched-tokens 136000 \
--block-size 128 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--nnodes 2 \
--node-rank 1 \
--headless \
--master-addr $master_addr \
--master-port 7001 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--decode-context-parallel-size 8 \
--prefill-context-parallel-size 2 \
--cp-kv-cache-interleave-size 128 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--enforce-eager \
--served-model-name deepseek_v3 \
--seed 1024 \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--max-model-len 136000 \
--max-num-batched-tokens 136000 \
--block-size 128 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--nnodes 2 \
--node-rank 1 \
--headless \
--master-addr $master_addr \
--master-port 7001 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "1",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
::::
::::
::::{tab-item} Decoder node 1
:sync: decoder node1
::::{tab-item} Decoder node 1
:sync: decoder node1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.3"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
```shell
nic_name="eth0" # network card name
local_ip="192.0.0.3"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=768
export OMP_PROC_BIND=false
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 0 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5980 \
--decode-context-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--no-enable-prefix-caching \
--distributed-executor-backend mp \
--served-model-name deepseek_v3 \
--seed 1024 \
--max-model-len 136000 \
--max-num-batched-tokens 128 \
--enable-chunked-prefill \
--max-num-seqs 4 \
--trust-remote-code \
--gpu-memory-utilization 0.96 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "3",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 0 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5980 \
--decode-context-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--quantization ascend \
--no-enable-prefix-caching \
--distributed-executor-backend mp \
--served-model-name deepseek_v3 \
--seed 1024 \
--max-model-len 136000 \
--max-num-batched-tokens 128 \
--enable-chunked-prefill \
--max-num-seqs 4 \
--trust-remote-code \
--gpu-memory-utilization 0.96 \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4]}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "3",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 16
},
"decode": {
"dp_size": 1,
"tp_size": 16
}
}
}'
```
::::
::::
:::::
:::::
2. Prefill master node `proxy.sh` script
```shell
python load_balance_proxy_server_example.py \
--port 8005 \
--host 192.0.0.1 \
--prefiller-hosts \
192.0.0.1 \
--prefiller-ports \
8004 \
--decoder-hosts \
192.0.0.3 \
--decoder-ports \
8004
```
```shell
python load_balance_proxy_server_example.py \
--port 8005 \
--host 192.0.0.1 \
--prefiller-hosts \
192.0.0.1 \
--prefiller-ports \
8004 \
--decoder-hosts \
192.0.0.3 \
--decoder-ports \
8004
```
3. Run proxy
@@ -347,7 +349,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -157,7 +157,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -22,55 +22,55 @@ Execute the following commands on each node in sequence. The results must all be
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
```
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```bash
cat /etc/hccn.conf
```
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses
```bash
# Get virtual NPU IP.
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
```bash
# Get virtual NPU IP.
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
4. Get superpodid and SDID
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
5. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with virtual NPU IP address).
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with virtual NPU IP address).
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
6. Check NPU TLS Configuration
```bash
# The TLS settings should be consistent across all nodes
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
```
```bash
# The TLS settings should be consistent across all nodes
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
```
::::
@@ -78,48 +78,48 @@ for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```bash
cat /etc/hccn.conf
```
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
5. Check NPU TLS Configuration
```bash
# The TLS settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
```bash
# The TLS settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
::::

View File

@@ -12,48 +12,48 @@ Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1
1. Single Node Verification:
Execute the following commands in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
```
2. Check NPU HCCN Configuration:
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
```bash
cat /etc/hccn.conf
```
```bash
cat /etc/hccn.conf
```
3. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address).
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address).
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
5. Check NPU TLS Configuration
```bash
# The tls settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
```bash
# The tls settings should be consistent across all nodes
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
```
## Run with Docker

View File

@@ -86,7 +86,8 @@ docker run --rm \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /path/to/shared/cache:/root/.cache \ # IMPORTANT: This must be a shared directory accessible by all nodes
# IMPORTANT: This must be a shared directory accessible by all nodes
-v /path/to/shared/cache:/root/.cache \
-it $IMAGE bash
```

View File

@@ -12,8 +12,8 @@ This document provides step-by-step guidance on how to deploy and benchmark the
| Common Sense Reasoning | ARC |
| Mathematical Reasoning | gsm8k |
| Natural Language Understanding | SuperGLUE_BoolQ |
| Comprehensive Examination | agieval |
| Multi-turn Dialogue | sharegpt |
| Comprehensive Examination | AGIEval |
| Multi-turn Dialogue | ShareGPT |
The benchmarking tool used in this tutorial is AISBench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
@@ -171,7 +171,7 @@ Below is the raw detailed test results:
| 1 | 207 | 314 | 100 | 54.1 | 18.4 | 36.1 | 26.8 | 33.4% | 49.8% | 45.6% |
| 16 | 207 | 314 | 100 | 60.0 | 229.7 | 43.5 | 303.9 | 33.4% | 38.0% | 32.3% |
| 32 | 207 | 314 | 100 | 62.7 | 396.4 | 47.8 | 507.5 | 33.4% | 31.3% | 28.0% |
| **Agieval** | | | | | | | | | | |
| **AGIEval** | | | | | | | | | | |
| 1 | 735 | 1880 | 100 | 53.1 | 18.7 | 31.8 | 34.1 | 50.3% | 66.8% | 81.9% |
| 24 | 735 | 1880 | 100 | 64.0 | 381.2 | 43.3 | 629.0 | 50.3% | 47.8% | 65.0% |
| 34 | 735 | 1880 | 100 | 70.0 | 494.6 | 50.2 | 768.4 | 50.3% | 39.4% | 55.3% |

View File

@@ -2,7 +2,7 @@
```{note}
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes related to model coverage and performance improvement.
2. Currently, the 310I series only supports eager mode and the float16 data type.
2. Currently, the Atlas 300I series only supports eager mode and the float16 data type.
```
## Run vLLM on Atlas 300I Series
@@ -180,7 +180,6 @@ Run the following script (`example.py`) to execute offline inference on NPU:
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
@@ -204,7 +203,7 @@ llm = LLM(
tensor_parallel_size=1,
max_model_len=4096,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
dtype="float16", # IMPORTANT: Some ATB ops do not support bf16 on the 300I series.
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
@@ -247,7 +246,7 @@ llm = LLM(
tensor_parallel_size=2,
max_model_len=4096,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
dtype="float16", # IMPORTANT: Some ATB ops do not support bf16 on the 300I series.
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
@@ -290,7 +289,7 @@ llm = LLM(
tensor_parallel_size=1,
max_model_len=4096,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
dtype="float16", # IMPORTANT: Some ATB ops do not support bf16 on the 300I series.
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)

View File

@@ -122,7 +122,7 @@ The parameters are explained as follows:
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
- `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it at least `35000`.
- `--max-model-len` specifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of `16384` is sufficient, however, for precision testing, please set it to at least `35000`.
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
- If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput.
@@ -181,7 +181,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
```shell
#!/bin/sh
# this obtained through ifconfig
# this is obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
@@ -258,10 +258,10 @@ Here are two accuracy evaluation methods.
2. After execution, you can get the result, here is the result of `DeepSeek-R1-W8A8` in `vllm-ascend:0.11.0rc2` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024dataset | - | accuracy | gen | 80.00 |
| gpqadataset | - | accuracy | gen | 72.22 |
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024dataset | - | accuracy | gen | 80.00 |
| gpqadataset | - | accuracy | gen | 72.22 |
### Using Language Model Evaluation Harness
@@ -271,13 +271,13 @@ As an example, take the `gsm8k` dataset as a test dataset, and run accuracy eval
2. Run `lm_eval` to execute the accuracy evaluation.
```shell
lm_eval \
--model local-completions \
--model_args model=path/DeepSeek-R1-W8A8,base_url=http://<node0_ip>:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
```shell
lm_eval \
--model local-completions \
--model_args model=path/DeepSeek-R1-W8A8,base_url=http://<node0_ip>:<port>/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
3. After execution, you can get the result.
@@ -291,7 +291,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `DeepSeek-R1-W8A8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -27,7 +27,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
- `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot).
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
- `Quantization method`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use this method to quantize the model.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
@@ -264,391 +264,391 @@ To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to depl
2. Prefill Node 0 `run_dp_template.sh` script
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.1"
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.1"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--additional-config '{"recompute_scheduler_enable":true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--additional-config '{"recompute_scheduler_enable":true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
3. Prefill Node 1 `run_dp_template.sh` script
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.2"
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.2"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=256
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--additional-config '{"recompute_scheduler_enable":true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--additional-config '{"recompute_scheduler_enable":true}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "1",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
4. Decode Node 0 `run_dp_template.sh` script
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.3"
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.3"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=1100
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=1100
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 256 \
--max-num-seqs 28 \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--quantization ascend \
--no-enable-prefix-caching \
--async-scheduling \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 256 \
--max-num-seqs 28 \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--quantization ascend \
--no-enable-prefix-caching \
--async-scheduling \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
5. Decode Node 1 `run_dp_template.sh` script
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.4"
```shell
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="141.xx.xx.4"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=1100
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=1100
export TASK_QUEUE_ENABLE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 256 \
--max-num-seqs 28 \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--quantization ascend \
--no-enable-prefix-caching \
--async-scheduling \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \
--port $2 \
--data-parallel-size $3 \
--data-parallel-rank $4 \
--data-parallel-address $5 \
--data-parallel-rpc-port $6 \
--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_v3 \
--max-model-len 65536 \
--max-num-batched-tokens 256 \
--max-num-seqs 28 \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--quantization ascend \
--no-enable-prefix-caching \
--async-scheduling \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 8, 16, 24, 32, 48, 56]}' \
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "2",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 32,
"tp_size": 1
}
}
}'
```
**Notice:**
The parameters are explained as follows:
**Notice:**
The parameters are explained as follows:
- `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
- `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
- `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`: enables the communication optimization function on the prefill nodes.
- `VLLM_ASCEND_ENABLE_MLAPO=1`: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.
- `--async-scheduling`: enables the asynchronous scheduling function. When Multi-Token Prediction (MTP) is enabled, asynchronous scheduling of operator delivery can be implemented to overlap the operator delivery latency.
- `cudagraph_capture_sizes`: The recommended value is `n x (mtp + 1)`. And the min is `n = 1` and the max is `n = max-num-seqs`. For other values, it is recommended to set them to the number of frequently occurring requests on the Decode (D) node.
- `recompute_scheduler_enable: true`: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.
- `multistream_overlap_shared_expert: true`: When the Tensor Parallelism (TP) size is 1 or `enable_shared_expert_dp: true`, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
- `lmhead_tensor_parallel_size: 16`: When the Tensor Parallelism (TP) size of the decode node is 1, this parameter allows the TP size of the LMHead embedding layer to be greater than 1, which is used to reduce the computational load of each card on the LMHead embedding layer.
6. run server for each node:
```shell
# p0
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
# p1
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
# d0
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
# d1
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
```
```shell
# p0
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100
# p1
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100
# d0
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
# d1
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
```
7. Run the `proxy.sh` script on the prefill master node
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
```shell
python load_balance_proxy_server_example.py \
--port 1999 \
--host 141.xx.xx.1 \
--prefiller-hosts \
141.xx.xx.1 \
141.xx.xx.1 \
141.xx.xx.2 \
141.xx.xx.2 \
--prefiller-ports \
7100 7101 7100 7101 \
--decoder-hosts \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
--decoder-ports \
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
```
```shell
python load_balance_proxy_server_example.py \
--port 1999 \
--host 141.xx.xx.1 \
--prefiller-hosts \
141.xx.xx.1 \
141.xx.xx.1 \
141.xx.xx.2 \
141.xx.xx.2 \
--prefiller-ports \
7100 7101 7100 7101 \
--decoder-hosts \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.3 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
141.xx.xx.4 \
--decoder-ports \
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 \
```
```shell
cd vllm-ascend/examples/disaggregated_prefill_v1/
bash proxy.sh
```
```shell
cd vllm-ascend/examples/disaggregated_prefill_v1/
bash proxy.sh
```
## Functional Verification
@@ -704,7 +704,7 @@ The performance result is:
Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -806,31 +806,31 @@ Refer to [Distributed DP Server With Large-Scale Expert Parallelism](https://doc
1. Prefill node 0
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
2. Prefill node 1
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```
3. Decode node 0
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
4. Decode node 1
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```
### Request Forwarding
@@ -896,13 +896,13 @@ As an example, take the `gsm8k` dataset as a test dataset, and run accuracy eval
2. Run `lm_eval` to execute the accuracy evaluation.
```shell
lm_eval \
--model local-completions \
--model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
```shell
lm_eval \
--model local-completions \
--model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
```
3. After execution, you can get the result.
@@ -926,7 +926,7 @@ The performance result is:
Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -668,31 +668,31 @@ Once the preparation is done, you can start the server with the following comman
1. Prefill node 0
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 12880 --vllm-start-port 9300
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p0_ip --dp-rpc-port 12880 --vllm-start-port 9300
```
2. Prefill node 1
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p1_ip --dp-rpc-port 12880 --vllm-start-port 9300
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address $node_p1_ip --dp-rpc-port 12880 --vllm-start-port 9300
```
3. Decode node 0
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
```
4. Decode node 1
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
```
```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address $node_d0_ip --dp-rpc-port 12778 --vllm-start-port 9300
```
### Request Forwarding
@@ -722,7 +722,7 @@ python load_balance_proxy_server_example.py \
$node_d1_ip \
--decoder-ports \
9300 9301 9302 9303 \
9300 9301 9302 9303 \
9300 9301 9302 9303
```
## Functional Verification
@@ -763,7 +763,7 @@ Here are two accuracy evaluation methods.
### Using Language Model Evaluation Harness
Not test yet.
Not tested yet.
## Performance
@@ -775,7 +775,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `GLM-4.x` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:
@@ -802,7 +802,7 @@ vllm bench serve \
--port 8000 \
--endpoint /v1/completions \
--max-concurrency 1 \
--request-rate 1 \
--request-rate 1
```
After about several minutes, you can get the performance evaluation result.

View File

@@ -1351,7 +1351,7 @@ Here are two accuracy evaluation methods.
### Using Language Model Evaluation Harness
Not test yet.
Not tested yet.
## Performance

View File

@@ -103,7 +103,7 @@ Your model files look like:
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
`-- tokenizer_config.json
|-- tokenizer_config.json
```
## Online Inference on Multi-NPU

View File

@@ -215,46 +215,46 @@ The 910B4 device supports inference using the PaddlePaddle framework.
1. Pull the PaddlePaddle-compatible CANN image
```bash
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-aarch64-gcc84
```
```bash
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-aarch64-gcc84
```
Start the container using the following command:
Start the container using the following command:
```bash
docker run -it --name paddle-npu-dev -v $(pwd):/work \
--privileged --network=host --shm-size=128G -w=/work \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/dcmi:/usr/local/dcmi \
-e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-$(uname -m)-gcc84 /bin/bash
```
```bash
docker run -it --name paddle-npu-dev -v $(pwd):/work \
--privileged --network=host --shm-size=128G -w=/work \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/dcmi:/usr/local/dcmi \
-e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann800-ubuntu20-npu-910b-base-$(uname -m)-gcc84 /bin/bash
```
2. Install [PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=undefined) and [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
```bash
python -m pip install paddlepaddle==3.2.0
wget https://paddle-whl.bj.bcebos.com/stable/npu/paddle-custom-npu/paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
pip install paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
python -m pip install -U "paddleocr[doc-parser]"
pip install safetensors
```
```bash
python -m pip install paddlepaddle==3.2.0
wget https://paddle-whl.bj.bcebos.com/stable/npu/paddle-custom-npu/paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
pip install paddle_custom_npu-3.2.0-cp310-cp310-linux_aarch64.whl
python -m pip install -U "paddleocr[doc-parser]"
pip install safetensors
```
:::{note}
The OpenCV component may be missing:
:::{note}
The OpenCV component may be missing:
```bash
apt-get update
apt-get install -y libgl1 libglib2.0-0
```
```bash
apt-get update
apt-get install -y libgl1 libglib2.0-0
```
CANN-8.0.0 does not support some versions of NumPy and OpenCV. It is recommended to install the specified versions.
CANN-8.0.0 does not support some versions of NumPy and OpenCV. It is recommended to install the specified versions.
```bash
python -m pip install numpy==1.26.4
python -m pip install opencv-python==3.4.18.65
```
```bash
python -m pip install numpy==1.26.4
python -m pip install opencv-python==3.4.18.65
```
::::
::::{tab-item} OM inference

View File

@@ -328,7 +328,7 @@ vllm serve Qwen/Qwen3-VL-8B-Instruct \
```
:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::
If your service start successfully, you can see the info shown below:
@@ -474,8 +474,6 @@ The accuracy of some models is already within our CI monitoring scope, including
- `Qwen2.5-VL-7B-Instruct`
- `Qwen3-VL-8B-Instruct`
You can refer to the [monitoring configuration](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test_nightly_a2.yaml).
:::::{tab-set}
:sync-group: install
@@ -486,28 +484,28 @@ As an example, take the `mmmu_val` dataset as a test dataset, and run accuracy e
1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more details on `lm_eval` installation.
```shell
pip install lm_eval
```
```shell
pip install lm_eval
```
2. Run `lm_eval` to execute the accuracy evaluation.
```shell
lm_eval \
--model vllm-vlm \
--model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
--tasks mmmu_val \
--batch_size 32 \
--apply_chat_template \
--trust_remote_code \
--output_path ./results
```
```shell
lm_eval \
--model vllm-vlm \
--model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
--tasks mmmu_val \
--batch_size 32 \
--apply_chat_template \
--trust_remote_code \
--output_path ./results
```
3. After execution, you can get the result, here is the result of `Qwen3-VL-8B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val | 0|none | |acc |↑ |0.5389|± |0.0159|
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val | 0|none | |acc |↑ |0.5389|± |0.0159|
::::
::::{tab-item} Qwen2.5-VL-32B-Instruct
@@ -517,27 +515,27 @@ As an example, take the `mmmu_val` dataset as a test dataset, and run accuracy e
1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for more details on `lm_eval` installation.
```shell
pip install lm_eval
```
```shell
pip install lm_eval
```
2. Run `lm_eval` to execute the accuracy evaluation.
```shell
lm_eval \
--model vllm-vlm \
--model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
--tasks mmmu_val \
--apply_chat_template \
--trust_remote_code \
--output_path ./results
```
```shell
lm_eval \
--model vllm-vlm \
--model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
--tasks mmmu_val \
--apply_chat_template \
--trust_remote_code \
--output_path ./results
```
3. After execution, you can get the result, here is the result of `Qwen2.5-VL-32B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val | 0|none | |acc |↑ |0.5744|± |0.0158|
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val | 0|none | |acc |↑ |0.5744|± |0.0158|
::::
:::::
@@ -546,7 +544,7 @@ lm_eval \
### Using vLLM Benchmark
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -99,7 +99,7 @@ Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 pla
```shell
#!/bin/sh
export ASCEND_RT_VISIBLE_DEVICES=0
export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
export MODEL_PATH="./Qwen2.5-7B-Instruct"
vllm serve ${MODEL_PATH} \
--host 0.0.0.0 \
@@ -156,7 +156,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:
@@ -177,4 +177,4 @@ vllm bench serve \
--result-dir ./perf_results/
```
After about several minutes, you can get the performance evaluation result.
After several minutes, you can get the performance evaluation result.

View File

@@ -69,7 +69,7 @@ docker run --rm \
#### Single NPU (Qwen2.5-Omni-7B)
:::{note}
The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.
The **environment variable** `LOCAL_MEDIA_PATH` which **allows** API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.
:::
@@ -99,7 +99,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
`--allowed-local-media-path` is optional, only set it if you need infer model with local media file.
`--gpu-memory-utilization` should not be set manually only if you know what this parameter aims to.
`--gpu-memory-utilization` should not be set manually unless you know what this parameter does.
#### Multiple NPU (Qwen2.5-Omni-7B)
@@ -128,7 +128,7 @@ Not supported yet.
## Functional Verification
If your service start successfully, you can see the info shown below:
If your service **starts** successfully, you can see the info shown below:
```bash
INFO: Started server process [2736]
@@ -195,7 +195,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen2.5-Omni-7B` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G × 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G × 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
@@ -191,7 +191,7 @@ vllm serve Qwen/Qwen3-235B-A22B \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--gpu-memory-utilization 0.9
```
Node1
@@ -298,7 +298,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:
@@ -440,7 +440,6 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
--enforce-eager \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",

View File

@@ -8,7 +8,7 @@ Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend envir
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation.
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
The Qwen3 Dense models are first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
## Supported Features
@@ -288,7 +288,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen3-32B-W8A8` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -152,7 +152,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen3-Next` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -16,7 +16,7 @@ Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user
### Model Weight
- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
- `Qwen3-Omni-30B-A3B-Thinking` requires 2 NPU Cards(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Installation
@@ -239,32 +239,32 @@ As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and
1. Refer to Using evalscope(<https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip>) for `evalscope`installation.
2. Run `evalscope` to execute the accuracy evaluation.
```bash
evalscope eval \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
--api-url http://localhost:8000/v1 \
--api-key EMPTY \
--eval-type server \
--datasets omni_bench, gsm8k, bbh \
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
--eval-batch-size 1 \
--generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
--limit 100
```
```bash
evalscope eval \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
--api-url http://localhost:8000/v1 \
--api-key EMPTY \
--eval-type server \
--datasets omni_bench, gsm8k, bbh \
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
--eval-batch-size 1 \
--generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
--limit 100
```
3. After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
```bash
+-----------------------------+------------+----------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=============================+============+==========+==========+=======+=========+=========+
| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default |
+-----------------------------+------------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default |
+-----------------------------+-----------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | |
+-----------------------------+------------+----------+----------+-------+---------+---------+
```
```bash
+-----------------------------+------------+----------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=============================+============+==========+==========+=======+=========+=========+
| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default |
+-----------------------------+------------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default |
+-----------------------------+-----------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | |
+-----------------------------+------------+----------+----------+-------+---------+---------+
```
## Performance
@@ -272,7 +272,7 @@ evalscope eval \
Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
Refer to vllm benchmark for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
- `Qwen3-VL-235B-A22B-Instruct`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node2 Atlas 800 A264G * 8nodes. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-235B-A22B-Instruct/)
- `Qwen3-VL-235B-A22B-Instruct`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node2 Atlas 800 A264G × 8nodes. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-VL-235B-A22B-Instruct/)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -140,7 +140,7 @@ Node1
export VLLM_USE_MODELSCOPE=true
# To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig
# this is obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"
@@ -258,7 +258,7 @@ Refer to [Using AISBench for performance evaluation](../../developer_guide/evalu
Run performance evaluation of `Qwen3-VL-235B-A22B-Instruct` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
There are three `vllm bench` subcommands:

View File

@@ -204,4 +204,4 @@ INFO 12-24 09:49:22 [loggers.py:257] Engine 000: Avg prompt throughput: 19.6 tok
### Offline Inference
The usage of offline inference with `Qwen3-VL-30B-A3B-Instruct` is totally the same as that of `Qwen3-VL-8B-Instruct`, find more details at [link](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen-VL-Dense.html#offline-inference).
The usage of offline inference with `Qwen3-VL-30B-A3B-Instruct` is totally the same as that of `Qwen3-VL-8B-Instruct`, find more details at [Qwen3-VL-8B-Instruct](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/models/Qwen-VL-Dense.html#offline-inference).

View File

@@ -212,7 +212,7 @@ Processed prompts: 100%|██████████████████
For more examples, refer to the vLLM official examples:
- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py)
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py)
- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_rerank_api_online.py)
## Performance

View File

@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
### Model Weight
- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp)
- `Qwen3.5-27B`(BF16 version): requires 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
- `Qwen3.5-27B-w8a8`(Quantized version): requires 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
@@ -87,7 +87,7 @@ If you want to deploy multi-node environment, you need to set up environment on
### Single-node Deployment
`Qwen3.5-27B` and `Qwen3.5-27B-w8a8` can both be deployed on 1 Atlas 800 A3(64G16), 1 Atlas 800 A2(64G8). Quantized version need to start with parameter --quantization ascend.
`Qwen3.5-27B` and `Qwen3.5-27B-w8a8` can both be deployed on 1 Atlas 800 A3(64G × 16), 1 Atlas 800 A2(64G × 8). Quantized version needs to start with parameter --quantization ascend.
Run the following script to execute online 128k inference.

View File

@@ -13,7 +13,7 @@ This capability supports heterogeneous parallelism strategies within a single mo
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
- **Reduced Per-Device Memory Footprint**:
Fine-grained TP shards large weight matrices( LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
- **Faster Memory Access in GEMMs**:
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.

View File

@@ -18,9 +18,8 @@ Batch invariance is crucial for several use cases:
## Hardware Requirements
Batch invariance currently requires Ascend NPUs for 910B,
because only 910B supports batch invariance with HCCL communication for now,
we will support other NPUs in the future.
Batch invariance currently requires Ascend 910B NPUs, because only the 910B supports batch invariance with HCCL communication for now.
We will support other NPUs in the future.
## Software Requirements

View File

@@ -11,7 +11,7 @@ Context parallel mainly solves the problem of serving long context requests. As
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batch size (and hence the throughput).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/Design_Documents/context_parallel.md).
## Supported Scenarios

View File

@@ -11,15 +11,16 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
```bash
```bash
pip install pandas
```
### Tuning Parameters
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python

View File

@@ -6,30 +6,28 @@ A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in
1. **Independent, fine-grained scaling**
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
2. **Lower time-to-first-token (TTFT)**
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
3. **Cross-process reuse and caching of encoder outputs**
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
Design doc: <
<https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
>
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
---
## Usage
The current reference pathway is **ExampleConnector**.
Below ready-to-run scripts shows the workflow:
The ready-to-run scripts below show the workflow:
1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`
@@ -45,7 +43,7 @@ Below ready-to-run scripts shows the workflow:
Disaggregated encoding is implemented by running two parts:
* **Encoder instance** a vLLM instance to performs vision encoding.
* **Encoder instance** a vLLM instance to perform vision encoding.
* **Prefill/Decode (PD) instance(s)** runs language pre-fill and decode.
* PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)
@@ -62,11 +60,11 @@ All related code is under `vllm/distributed/ec_transfer`.
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and referred to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/feature_guide/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
## Limitations

View File

@@ -83,7 +83,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
3. Model Compatibility:
- Only MoE models with explicit expert parallelism support (e.g., Qwen3 MoE models) are compatible.
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
- Verify model architecture supports dynamic expert routing through `--enable-expert-parallel`.
4. Monitoring & Validation:
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.

View File

@@ -65,7 +65,7 @@ Online example:
vllm serve path/to/Qwen3-32B --tensor-parallel-size 8 --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}'
```
You can find more details about Xlite [here](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
You can find more details about [Xlite](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md)
## Fallback to the Eager Mode

View File

@@ -19,41 +19,41 @@ Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
3. Get superpodid and SDID
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
::::
@@ -61,35 +61,35 @@ for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
3. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
```bash
# Execute on the target node (replace 'x.x.x.x' with actual NPU IP address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
::::
@@ -426,59 +426,59 @@ In the PD separation scenario, we provide an optimized configuration.
2. add '--enforce-eager' command to 'vllm serve'
3. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
```shell
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
4. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
```shell
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
```
- **decoder node**
1. set HCCL_BUFFSIZE=1024
2. Take '--kv-transfer-config' as follows:
```shell
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
```shell
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0"
}'
```
3. Take '--additional-config' as follows:
```shell
--additional-config '{"enable_weight_nz_layout":true}'
```
```shell
--additional-config '{"enable_weight_nz_layout":true}'
```
### Parameters Description
1. '--additional-config' Parameter Introduction:
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
- **"enable_weight_nz_layout"**: Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
- **"enable_prefill_optimizations"**: Whether to enable DeepSeek models' prefill optimizations.
<br>
2. Enable MTP
Add the following command to your configurations.
```shell
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
```
```shell
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
```
### Recommended Configuration Example

View File

@@ -56,16 +56,16 @@ Assuming your working directory is ```/workspace``` and vllm/vllm-ascend have al
1. Install LMCache Repo
```bash
NO_CUDA_EXT=1 pip install lmcache==0.3.12
```
```bash
NO_CUDA_EXT=1 pip install lmcache==0.3.12
```
2. Install LMCache-Ascend Repo
```bash
cd /workspace/LMCache-Ascend
python3 -m pip install -v --no-build-isolation -e .
```
```bash
cd /workspace/LMCache-Ascend
python3 -m pip install -v --no-build-isolation -e .
```
### Usage

View File

@@ -32,4 +32,4 @@ vllm serve Qwen/Qwen2-7B-Instruct
--additional-config '{"ascend_compilation_config":{"enable_npugraph_ex":true, "enable_static_kernel":false}}'
```
You can find more details about npugraph_ex [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)
You can find more details about [npugraph_ex](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00021.html)

View File

@@ -14,11 +14,11 @@ Currently, vllm-ascend has implemented Sequence Parallelism for VL-class models
```bash
vllm serve Qwen/Qwen3-VL-2B-Instruct \
--tensor-parallel-size 2 \
--compilation-config '{"pass_config": {"enable_sp": true, , "sp_min_token_num": 1000}}'
--compilation-config '{"pass_config": {"enable_sp": true , "sp_min_token_num": 1000}}'
```
- `"enable_sp"`: This is the switch for SP. Since SP relies on graph mode, it is not supported in eager mode.
- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative benefits. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.
- `sp_min_token_num` (from upstream vllm's `pass_config`): Based on our experiments, when the number of tokens is small (empirical value is less than 1000), SP can actually bring negative impact. This is because when the communication volume is small, the fixed overhead of the communication operator becomes the dominant factor. SP will only take effect when `num_tokens >= sp_min_token_num`. **The default value is 1000 on Ascend, which generally does not need to be modified.** To customize, use `--compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'`. The value will be appended into `compile_ranges_split_points`, which splits the graph compilation range and checks whether the pass is applicable per range.
Without modifying `sp_min_token_num`, the simplest way and recommended way to enable SP is:

View File

@@ -91,7 +91,7 @@ A few important things to consider when using the EAGLE based draft models:
## Speculating using MTP speculators
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/Multi_Token_Prediction.html)
- Online inference

View File

@@ -2,7 +2,7 @@
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it is best not to enable prefetching.
Since we use vector computations to hide the weight prefetching pipeline, this has an effect on computation. If you prioritize low latency over high throughput, it is best not to enable prefetching.
## Quick Start
@@ -35,39 +35,39 @@ Notes:
1) For MoE model:
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0
},
"moe": {
"gate_up": 0.8
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0
},
"moe": {
"gate_up": 0.8
}
}
}
}
}'
```
}'
```
2) For dense model:
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"mlp": {
"gate_up": 1.0,
"down": 1.0
```shell
--additional-config \
'{
"weight_prefetch_config": {
"enabled": true,
"prefetch_ratio": {
"mlp": {
"gate_up": 1.0,
"down": 1.0
}
}
}
}
}'
```
}'
```

View File

@@ -49,9 +49,9 @@ This is the first release candidate of v0.17.0 for vLLM Ascend. Please follow th
### Known Issue
- GLM5 requires transformers==5.2.0, and this will resolved by [vllm-project/vllm#30566](https://github.com/vllm-project/vllm/pull/30566), will not included in v0.17.0.
- GLM5 requires transformers==5.2.0, and this will be resolved by [vllm-project/vllm#30566](https://github.com/vllm-project/vllm/pull/30566), will not be included in v0.17.0.
- There is a precision issue with Qwen3-Next due to the changed tp weight split method. Will fix it in next release.
- The minimum number of tokens of prefix cache hit in hybrid model is large now. The exact number is related to tp size, e.g., with tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached.
- In hybrid models, the minimum token count required for a prefix cache hit is currently large. The exact number is related to tp size, e.g., with tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached.
- GLM5 has an issue in the 2-node PD mixed deployment scenario where inference may hang when concurrency exceeds 8 (fixed in PR [#7235](https://github.com/vllm-project/vllm-ascend/pull/7235) [#7290](https://github.com/vllm-project/vllm-ascend/pull/7290)).
## v0.16.0rc1 - 2026.03.09
@@ -96,7 +96,7 @@ This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow th
### Deprecation & Breaking Changes
- `enable_flash_comm_v1` config option has been renamed back to `enable_sp`. [#6883](https://github.com/vllm-project/vllm-ascend/pull/6883)
- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add `--quantization ascend` to serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. [#6873](https://github.com/vllm-project/vllm-ascend/pull/6873)
- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add `--quantization ascend` to serve a model quantized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. [#6873](https://github.com/vllm-project/vllm-ascend/pull/6873)
### Documentation
@@ -448,7 +448,7 @@ We mainly focus on quality and performance improvement in this release. The spec
- `CPUOffloadingConnector` is deprecated. We'll remove it in the next release. It'll be replaced by CPUOffload feature from vLLM in the future.
- eplb config options is moved to `eplb_config` in [additional config](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/configuration/additional_config.html). The old ones will be removed in the next release.
- `ProfileExecuteDuration` [feature](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/profile_execute_duration.html) is deprecated. It's replaced by `ObservabilityConfig` from vLLM.
- `ProfileExecuteDuration` [feature](https://github.com/vllm-project/vllm-ascend/blob/v0.13.0rc2/docs/source/developer_guide/performance_and_debug/profile_execute_duration.md) is deprecated. It's replaced by `ObservabilityConfig` from vLLM.
- The value of `VLLM_ASCEND_ENABLE_MLAPO` env will be set to True by default in the next release. It'll be enabled in decode node by default. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False.
## v0.13.0rc1 - 2025.12.27
@@ -457,8 +457,8 @@ This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots o
### Highlights
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html)
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen3-Next.html)
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://github.com/vllm-project/vllm-ascend/blob/v0.13.0rc1/docs/source/tutorials/DeepSeek-V3.2.md)
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://github.com/vllm-project/vllm-ascend/blob/v0.13.0rc1/docs/source/tutorials/Qwen3-Next.md)
- [Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to [context parallel feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/context_parallel.html)
### Features
@@ -552,7 +552,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
### Highlights
- DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html) to start using it.
- DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the [official tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.12.0rc1/docs/source/tutorials/DeepSeek-V3.2.md) to start using it.
- Async scheduler is more stable and ready to enable now. Please set `--async-scheduling` to enable it.
- More new models, such as Qwen3-omni, DeepSeek OCR, PaddleOCR, OpenCUA are supported now.
@@ -572,7 +572,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
### Other
- [Experimental] Mooncake layerwise connector is supported now.
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] [KV cache pool](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/Design_Documents/KV_Cache_Pool_Guide.html) feature is added
- [Experimental] A new graph mode `xlite` is introduced. It performs good with some models. Following the [official tutorial](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html#using-xlitegraph) to start using it.
- LLMdatadist kv connector is removed. Please use mooncake connector instead.
- Ascend scheduler is removed. `--additional-config {"ascend_scheduler": {"enabled": true}` doesn't work anymore.
@@ -580,7 +580,7 @@ This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots o
- `VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION` env is removed. This feature is stable enough. We enable it by default now.
- speculative decode method `Ngram` is back now.
- msprobe tool is added to help user to check the model accuracy. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/msprobe_guide.html) to get started.
- msserviceprofiler tool is added to help user to profile the model performance. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/performance_and_debug/service_profiling_guide.html) to get started.
- msserviceprofiler tool is added to help user to profile the model performance. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/blob/v0.12.0rc1/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md) to get started.
### Upgrade Note
@@ -658,12 +658,12 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
- CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. [#3945](https://github.com/vllm-project/vllm-ascend/pull/3945) [#3896](https://github.com/vllm-project/vllm-ascend/pull/3896)
- PrefixCache and Chunked Prefill are enabled by default. [#3967](https://github.com/vllm-project/vllm-ascend/pull/3967)
- W4A4 quantization is supported now. [#3427](https://github.com/vllm-project/vllm-ascend/pull/3427) Official tutorial is available at [here](<https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu_qwen3_w4a4.html>.
- W4A4 quantization is supported now. [#3427](https://github.com/vllm-project/vllm-ascend/pull/3427) Official tutorial is available at [single_npu_qwen3_w4a4]<https://github.com/vllm-project/vllm-ascend/pull/4076/changes>.
### Core
- Performance of Qwen3 and Deepseek V3 series models are improved.
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html).
- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [pd_disaggregation_mooncake_multi_node](https://github.com/vllm-project/vllm-ascend/blob/v0.11.0rc1/docs/source/tutorials/multi_node_pd_disaggregation_mooncake.md).
- MTP > 1 is supported now. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
- [Experimental] Graph mode `FULL_DECODE_ONLY` is supported now! And `FULL` will be landing in the next few weeks. [#2128](https://github.com/vllm-project/vllm-ascend/pull/2128)
- Pooling models, such as bge-m3, are supported now. [#3171](https://github.com/vllm-project/vllm-ascend/pull/3171)
@@ -687,7 +687,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release
## v0.11.0rc0 - 2025.09.30
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.11.0rc0) to get started.
### Highlights
@@ -711,11 +711,11 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow
## v0.10.2rc1 - 2025.09.16
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.10.2rc1) to get started.
### Highlights
- Added support for Qwen3-Next. Please note that the expert parallel and MTP features do not work with this release. We will be adding support for them soon. Follow the [official guide](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu_qwen3_next.html) to get started. [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
- Added support for Qwen3-Next. Please note that the expert parallel and MTP features do not work with this release. We will be adding support for them soon. Follow the [official guide](https://github.com/vllm-project/vllm-ascend/blob/v0.10.2rc1/docs/source/tutorials/multi_npu_qwen3_next.md) to get started. [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
- Added quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
### Core
@@ -754,7 +754,7 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
## v0.10.1rc1 - 2025.09.04
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.10.1rc1) to get started.
### Highlights
@@ -808,15 +808,15 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set `VLLM_USE_V1=1` to use V1 engine.
In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the [official guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html).
In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the [official guide](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1/docs/source/tutorials/large_scale_ep.md).
Please note that this release note will list all the important changes from last official release(v0.7.3)
### Highlights
- DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to [muliti node tutorials](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/multi_node.html) and [Large Scale Expert Parallelism](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html).
- DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to [multi node tutorials](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/multi_node.html) and [Large Scale Expert Parallelism](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1/docs/source/tutorials/large_scale_ep.md).
- Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to [Qwen tutorials](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/index.html).
- Disaggregated Prefilling support for V1 Engine. Please refer to [Large Scale Expert Parallelism](https://docs.vllm.ai/projects/ascend/en/v0.9.1/tutorials/large_scale_ep.html) tutorials.
- Disaggregated Prefilling support for V1 Engine. Please refer to [Large Scale Expert Parallelism](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1/docs/source/tutorials/large_scale_ep.md) tutorials.
- Automatic prefix caching and chunked prefill feature is supported.
- Speculative decoding feature works with Ngram and MTP method.
- MOE and dense w4a8 quantization support now. Please refer to [quantization guide](https://docs.vllm.ai/projects/ascend/en/v0.9.1/user_guide/feature_guide/quantization.html).
@@ -884,18 +884,18 @@ This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
## v0.10.0rc1 - 2025.08.07
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started. V0 is completely removed from this version.
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.10.0rc1) to get started. V0 is completely removed from this version.
### Highlights
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model [#950](https://github.com/vllm-project/vllm-ascend/pull/950), following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md).
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model [#950](https://github.com/vllm-project/vllm-ascend/pull/950), following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/examples/disaggregated_prefill_v1/README.md).
- W4A8 quantization method is supported for dense and MoE model now. [#2060](https://github.com/vllm-project/vllm-ascend/pull/2060) [#2172](https://github.com/vllm-project/vllm-ascend/pull/2172)
### Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to `2.7.1.dev20250724`. [#1562](https://github.com/vllm-project/vllm-ascend/pull/1562) And CANN hase been upgraded to `8.2.RC1`. [#1653](https://github.com/vllm-project/vllm-ascend/pull/1653) Dont forget to update them in your environment or using the latest images.
- vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node_kimi.md.html) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this [tutorial](https://github.com/vllm-project/vllm-ascend/blob/v0.10.0rc1/docs/source/tutorials/multi_node_kimi.md) to have a try. [#2162](https://github.com/vllm-project/vllm-ascend/pull/2162)
- Pipeline Parallelism is supported in V1 now. [#1800](https://github.com/vllm-project/vllm-ascend/pull/1800)
- Prefix cache feature now work with the Ascend Scheduler. [#1446](https://github.com/vllm-project/vllm-ascend/pull/1446)
- Torchair graph mode works with tp > 4 now. [#1508](https://github.com/vllm-project/vllm-ascend/issues/1508)
@@ -1055,7 +1055,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
## v0.9.2rc1 - 2025.07.11
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.2rc1) to get started. From this release, V1 engine will be enabled by default, there is no need to set `VLLM_USE_V1=1` any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
### Highlights
@@ -1106,7 +1106,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
## v0.9.1rc1 - 2025.06.22
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to get started.
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.1rc1) to get started.
### Experimental
@@ -1167,7 +1167,7 @@ This release contains some quick fixes for v0.9.0rc1. Please use this release in
## v0.9.0rc1 - 2025.06.09
This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment `VLLM_USE_V1=1` to enable V1 Engine.
This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.9.0rc1) to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment `VLLM_USE_V1=1` to enable V1 Engine.
### Highlights
@@ -1256,7 +1256,7 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
## v0.8.5rc1 - 2025.05.06
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [here](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html).
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.5rc1) to start the journey. Now you can enable V1 egnine by setting the environment variable `VLLM_USE_V1=1`, see the feature support status of vLLM Ascend in [supported_features](https://github.com/vllm-project/vllm-ascend/blob/v0.8.5rc1/docs/source/user_guide/suppoted_features.md).
### Highlights
@@ -1282,12 +1282,12 @@ This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [
## v0.8.4rc2 - 2025.04.29
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.4rc2) to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We'll make them stable enough in the next release.
### Highlights
- Qwen3 and Qwen3MOE is supported now. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/single_npu.html) to run the quick demo. [#709](https://github.com/vllm-project/vllm-ascend/pull/709)
- Ascend W8A8 quantization method is supported now. Please take the [official doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu_quantization.html) for example. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/619) is welcome. [#580](https://github.com/vllm-project/vllm-ascend/pull/580)
- Qwen3 and Qwen3MOE is supported now. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/blob/v0.8.4rc2/docs/source/tutorials/single_npu.md) to run the quick demo. [#709](https://github.com/vllm-project/vllm-ascend/pull/709)
- Ascend W8A8 quantization method is supported now. Please take the [official doc](https://github.com/vllm-project/vllm-ascend/blob/v0.8.4rc2/docs/source/tutorials/multi_npu_quantization.md) for example. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/619) is welcome. [#580](https://github.com/vllm-project/vllm-ascend/pull/580)
- DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it's still in experimental status. Let us know if you hit any problem. [#429](https://github.com/vllm-project/vllm-ascend/pull/429) [#585](https://github.com/vllm-project/vllm-ascend/pull/585) [#626](https://github.com/vllm-project/vllm-ascend/pull/626) [#636](https://github.com/vllm-project/vllm-ascend/pull/636) [#671](https://github.com/vllm-project/vllm-ascend/pull/671)
### Core
@@ -1305,18 +1305,18 @@ This is the second release candidate of v0.8.4 for vllm-ascend. Please follow th
## v0.8.4rc1 - 2025.04.18
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://docs.vllm.ai/projects/ascend/en/latest/community/versioning_policy.html#release-window).
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the [official doc](https://github.com/vllm-project/vllm-ascend/tree/v0.8.4rc1) to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the [official documentation](https://docs.vllm.ai/projects/ascend/en/latest/community/versioning_policy.html#release-window).
### Highlights
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/v0.8.4/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/v0.8.4/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
- Sleep Mode feature is supported. Currently it only works on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)
### Core
- The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. [#543](https://github.com/vllm-project/vllm-ascend/pull/543)
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the [official guide](https://docs.vllm.ai/en/latest/features/disagg_prefill.html) to use. [#432](https://github.com/vllm-project/vllm-ascend/pull/432)
- Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the [official guide](https://docs.vllm.ai/en/v0.8.4/features/disagg_prefill.html) to use. [#432](https://github.com/vllm-project/vllm-ascend/pull/432)
- Spec decode feature works now. Currently it only works on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500)
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
@@ -1336,7 +1336,7 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
### Highlights
- Add Ascend Custom Ops framework. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [here](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [this issue](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
- Prefix cache feature works now. You can set `enable_prefix_caching=True` to enable it. [#282](https://github.com/vllm-project/vllm-ascend/pull/282)
### Core
@@ -1394,13 +1394,13 @@ We are excited to announce the first release candidate of v0.7.1 for vllm-ascend
vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.
Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.1) to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions [here](https://github.com/vllm-project/vllm-ascend/issues/19)
Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.1) to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions [this issue](https://github.com/vllm-project/vllm-ascend/issues/19)
### Highlights
- Initial supports for Ascend NPU on vLLM. [#3](https://github.com/vllm-project/vllm-ascend/pull/3)
- DeepSeek is now supported. [#88](https://github.com/vllm-project/vllm-ascend/pull/88) [#68](https://github.com/vllm-project/vllm-ascend/pull/68)
- Qwen, Llama series and other popular models are also supported, you can see more details in [here](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/supported_models.html).
- Qwen, Llama series and other popular models are also supported, you can see more details in [supported_models](https://github.com/vllm-project/vllm-ascend/blob/v0.7.1rc1/docs/source/user_guide/supported_models.md).
### Core
@@ -1415,6 +1415,6 @@ Please follow the [official doc](https://docs.vllm.ai/projects/ascend/en/v0.7.1)
### Known Issues
- This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please [install](https://docs.vllm.ai/projects/ascend/en/v0.7.1rc1/installation.html) it manually if you are using non-container environment.
- This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please [install](https://github.com/vllm-project/vllm-ascend/blob/v0.7.1rc1/docs/source/installation.md) it manually if you are using non-container environment.
- There are logs like `No platform detected, vLLM is running on UnspecifiedPlatform` or `Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")` shown when running vllm-ascend. It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this [PR](https://github.com/vllm-project/vllm/pull/12432) which will be included in v0.7.3 soon.
- There are logs like `# CPU blocks: 35064, # CPU blocks: 2730` shown when running vllm-ascend which should be `# NPU blocks:` . It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this [PR](https://github.com/vllm-project/vllm/pull/13378) which will be included in v0.7.3 soon.

View File

@@ -1,6 +1,6 @@
# Supported Features
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
The feature support principle of vLLM Ascend is: **aligned with vLLM**. We are also actively collaborating with the community to accelerate support.
Functional call: <https://docs.vllm.ai/en/latest/features/tool_calling/>
@@ -37,7 +37,7 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
- 🔴 NO plan/Deprecated: No plan or deprecated by vLLM.
[v1_user_guide]: https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html
[multimodal]: https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen-VL-Dense.html
[multimodal]: https://docs.vllm.ai/projects/ascend/en/latest/tutorials/models/Qwen-VL-Dense.html
[guided_decoding]: https://github.com/vllm-project/vllm-ascend/issues/177
[LoRA]: https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/lora.html
[graph_mode]: https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html